Percentile
Percentile#
[1]:
import xarray
import climtas
import dask.array
import pandas
import numpy
We have hourly input data for a year that we want calculate the 90th percentile along the time axis for each grid point
[2]:
time = pandas.date_range('20010101', '20020101', freq='H', closed='left')
data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)
da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da
[2]:
<xarray.DataArray 'temperature' (time: 8760, lat: 50, lon: 100)> dask.array<random_sample, shape=(8760, 50, 100), dtype=float64, chunksize=(1440, 25, 25), chunktype=numpy.ndarray> Coordinates: * time (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00 * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0 * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
- time: 8760
- lat: 50
- lon: 100
- dask.array<chunksize=(1440, 25, 25), meta=np.ndarray>
Array Chunk Bytes 334.17 MiB 6.87 MiB Shape (8760, 50, 100) (1440, 25, 25) Count 56 Tasks 56 Chunks Type float64 numpy.ndarray - time(time)datetime64[ns]2001-01-01 ... 2001-12-31T23:00:00
array(['2001-01-01T00:00:00.000000000', '2001-01-01T01:00:00.000000000', '2001-01-01T02:00:00.000000000', ..., '2001-12-31T21:00:00.000000000', '2001-12-31T22:00:00.000000000', '2001-12-31T23:00:00.000000000'], dtype='datetime64[ns]')
- lat(lat)float64-90.0 -86.33 -82.65 ... 86.33 90.0
array([-90. , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653, -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837, -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 , -23.877551, -20.204082, -16.530612, -12.857143, -9.183673, -5.510204, -1.836735, 1.836735, 5.510204, 9.183673, 12.857143, 16.530612, 20.204082, 23.877551, 27.55102 , 31.22449 , 34.897959, 38.571429, 42.244898, 45.918367, 49.591837, 53.265306, 56.938776, 60.612245, 64.285714, 67.959184, 71.632653, 75.306122, 78.979592, 82.653061, 86.326531, 90. ])
- lon(lon)float64-180.0 -176.4 ... 172.8 176.4
array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2, -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8, -115.2, -111.6, -108. , -104.4, -100.8, -97.2, -93.6, -90. , -86.4, -82.8, -79.2, -75.6, -72. , -68.4, -64.8, -61.2, -57.6, -54. , -50.4, -46.8, -43.2, -39.6, -36. , -32.4, -28.8, -25.2, -21.6, -18. , -14.4, -10.8, -7.2, -3.6, 0. , 3.6, 7.2, 10.8, 14.4, 18. , 21.6, 25.2, 28.8, 32.4, 36. , 39.6, 43.2, 46.8, 50.4, 54. , 57.6, 61.2, 64.8, 68.4, 72. , 75.6, 79.2, 82.8, 86.4, 90. , 93.6, 97.2, 100.8, 104.4, 108. , 111.6, 115.2, 118.8, 122.4, 126. , 129.6, 133.2, 136.8, 140.4, 144. , 147.6, 151.2, 154.8, 158.4, 162. , 165.6, 169.2, 172.8, 176.4])
The Xarray way is to use xarray.DataArray.quantile, however for a dataset chunked along the time axis this will give an error message:
[3]:
try:
da.quantile(0.9, 'time')
except Exception as e:
print('Error:', e)
Error: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
To get this to work we must rechunk the data so it is not chunked on the time axis, a very expensive operation for large datasets where variables are split into multiple files for different years.
[4]:
da.chunk({'time': None}).quantile(0.9, 'time')
[4]:
<xarray.DataArray 'temperature' (lat: 50, lon: 100)> dask.array<getitem, shape=(50, 100), dtype=float64, chunksize=(25, 25), chunktype=numpy.ndarray> Coordinates: * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0 * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4 quantile float64 0.9
- lat: 50
- lon: 100
- dask.array<chunksize=(25, 25), meta=np.ndarray>
Array Chunk Bytes 39.06 kiB 4.88 kiB Shape (50, 100) (25, 25) Count 112 Tasks 8 Chunks Type float64 numpy.ndarray - lat(lat)float64-90.0 -86.33 -82.65 ... 86.33 90.0
array([-90. , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653, -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837, -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 , -23.877551, -20.204082, -16.530612, -12.857143, -9.183673, -5.510204, -1.836735, 1.836735, 5.510204, 9.183673, 12.857143, 16.530612, 20.204082, 23.877551, 27.55102 , 31.22449 , 34.897959, 38.571429, 42.244898, 45.918367, 49.591837, 53.265306, 56.938776, 60.612245, 64.285714, 67.959184, 71.632653, 75.306122, 78.979592, 82.653061, 86.326531, 90. ])
- lon(lon)float64-180.0 -176.4 ... 172.8 176.4
array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2, -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8, -115.2, -111.6, -108. , -104.4, -100.8, -97.2, -93.6, -90. , -86.4, -82.8, -79.2, -75.6, -72. , -68.4, -64.8, -61.2, -57.6, -54. , -50.4, -46.8, -43.2, -39.6, -36. , -32.4, -28.8, -25.2, -21.6, -18. , -14.4, -10.8, -7.2, -3.6, 0. , 3.6, 7.2, 10.8, 14.4, 18. , 21.6, 25.2, 28.8, 32.4, 36. , 39.6, 43.2, 46.8, 50.4, 54. , 57.6, 61.2, 64.8, 68.4, 72. , 75.6, 79.2, 82.8, 86.4, 90. , 93.6, 97.2, 100.8, 104.4, 108. , 111.6, 115.2, 118.8, 122.4, 126. , 129.6, 133.2, 136.8, 140.4, 144. , 147.6, 151.2, 154.8, 158.4, 162. , 165.6, 169.2, 172.8, 176.4])
- quantile()float640.9
array(0.9)
Dask has an approximate percentile operation that works on data chunked along time, however that will only run on one dimensional data
[5]:
dask.array.percentile(da.data[:, 30, 60], 90)
[5]:
|
climtas.blocked.approx_percentile extends the Dask parallel approximate percentile calculation to multi-dimensional datasets, so large datasets don’t need to be rechunked.
Note that this is an approximation to the true percentile, check it behaves appropriately for your dataset by comparing approx_percentile
and numpy.percentile
on a subset of the data.
[6]:
climtas.approx_percentile(da, 90, 'time')
[6]:
<xarray.DataArray 'temperature' (percentile: 1, lat: 50, lon: 100)> dask.array<_merge_approx_percentile, shape=(1, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray> Coordinates: * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0 * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4 * percentile (percentile) int64 90
- percentile: 1
- lat: 50
- lon: 100
- dask.array<chunksize=(1, 25, 25), meta=np.ndarray>
Array Chunk Bytes 39.06 kiB 4.88 kiB Shape (1, 50, 100) (1, 25, 25) Count 176 Tasks 8 Chunks Type float64 numpy.ndarray - lat(lat)float64-90.0 -86.33 -82.65 ... 86.33 90.0
array([-90. , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653, -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837, -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 , -23.877551, -20.204082, -16.530612, -12.857143, -9.183673, -5.510204, -1.836735, 1.836735, 5.510204, 9.183673, 12.857143, 16.530612, 20.204082, 23.877551, 27.55102 , 31.22449 , 34.897959, 38.571429, 42.244898, 45.918367, 49.591837, 53.265306, 56.938776, 60.612245, 64.285714, 67.959184, 71.632653, 75.306122, 78.979592, 82.653061, 86.326531, 90. ])
- lon(lon)float64-180.0 -176.4 ... 172.8 176.4
array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2, -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8, -115.2, -111.6, -108. , -104.4, -100.8, -97.2, -93.6, -90. , -86.4, -82.8, -79.2, -75.6, -72. , -68.4, -64.8, -61.2, -57.6, -54. , -50.4, -46.8, -43.2, -39.6, -36. , -32.4, -28.8, -25.2, -21.6, -18. , -14.4, -10.8, -7.2, -3.6, 0. , 3.6, 7.2, 10.8, 14.4, 18. , 21.6, 25.2, 28.8, 32.4, 36. , 39.6, 43.2, 46.8, 50.4, 54. , 57.6, 61.2, 64.8, 68.4, 72. , 75.6, 79.2, 82.8, 86.4, 90. , 93.6, 97.2, 100.8, 104.4, 108. , 111.6, 115.2, 118.8, 122.4, 126. , 129.6, 133.2, 136.8, 140.4, 144. , 147.6, 151.2, 154.8, 158.4, 162. , 165.6, 169.2, 172.8, 176.4])
- percentile(percentile)int6490
array([90])