Percentile

Percentile#

[1]:

import xarray
import climtas
import dask.array
import pandas
import numpy

We have hourly input data for a year that we want calculate the 90th percentile along the time axis for each grid point

[2]:

time = pandas.date_range('20010101', '20020101', freq='H', closed='left')

data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)

da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da

[2]:

<xarray.DataArray 'temperature' (time: 8760, lat: 50, lon: 100)>
dask.array<random_sample, shape=(8760, 50, 100), dtype=float64, chunksize=(1440, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4

xarray.DataArray

'temperature'

time: 8760
lat: 50
lon: 100

dask.array<chunksize=(1440, 25, 25), meta=np.ndarray>

	Array	Chunk
Bytes	334.17 MiB	6.87 MiB
Shape	(8760, 50, 100)	(1440, 25, 25)
Count	56 Tasks	56 Chunks
Type	float64	numpy.ndarray

Coordinates: (3)

time

(time)

datetime64[ns]

2001-01-01 ... 2001-12-31T23:00:00

array(['2001-01-01T00:00:00.000000000', '2001-01-01T01:00:00.000000000',
       '2001-01-01T02:00:00.000000000', ..., '2001-12-31T21:00:00.000000000',
       '2001-12-31T22:00:00.000000000', '2001-12-31T23:00:00.000000000'],
      dtype='datetime64[ns]')

lat

(lat)

float64

-90.0 -86.33 -82.65 ... 86.33 90.0

array([-90.      , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653,
       -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837,
       -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 ,
       -23.877551, -20.204082, -16.530612, -12.857143,  -9.183673,  -5.510204,
        -1.836735,   1.836735,   5.510204,   9.183673,  12.857143,  16.530612,
        20.204082,  23.877551,  27.55102 ,  31.22449 ,  34.897959,  38.571429,
        42.244898,  45.918367,  49.591837,  53.265306,  56.938776,  60.612245,
        64.285714,  67.959184,  71.632653,  75.306122,  78.979592,  82.653061,
        86.326531,  90.      ])

lon

(lon)

float64

-180.0 -176.4 ... 172.8 176.4

array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2,
       -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8,
       -115.2, -111.6, -108. , -104.4, -100.8,  -97.2,  -93.6,  -90. ,  -86.4,
        -82.8,  -79.2,  -75.6,  -72. ,  -68.4,  -64.8,  -61.2,  -57.6,  -54. ,
        -50.4,  -46.8,  -43.2,  -39.6,  -36. ,  -32.4,  -28.8,  -25.2,  -21.6,
        -18. ,  -14.4,  -10.8,   -7.2,   -3.6,    0. ,    3.6,    7.2,   10.8,
         14.4,   18. ,   21.6,   25.2,   28.8,   32.4,   36. ,   39.6,   43.2,
         46.8,   50.4,   54. ,   57.6,   61.2,   64.8,   68.4,   72. ,   75.6,
         79.2,   82.8,   86.4,   90. ,   93.6,   97.2,  100.8,  104.4,  108. ,
        111.6,  115.2,  118.8,  122.4,  126. ,  129.6,  133.2,  136.8,  140.4,
        144. ,  147.6,  151.2,  154.8,  158.4,  162. ,  165.6,  169.2,  172.8,
        176.4])

Attributes: (0)

The Xarray way is to use xarray.DataArray.quantile, however for a dataset chunked along the time axis this will give an error message:

[3]:

try:
    da.quantile(0.9, 'time')
except Exception as e:
    print('Error:', e)

Error: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.

To get this to work we must rechunk the data so it is not chunked on the time axis, a very expensive operation for large datasets where variables are split into multiple files for different years.

[4]:

da.chunk({'time': None}).quantile(0.9, 'time')

[4]:

<xarray.DataArray 'temperature' (lat: 50, lon: 100)>
dask.array<getitem, shape=(50, 100), dtype=float64, chunksize=(25, 25), chunktype=numpy.ndarray>
Coordinates:
  * lat       (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0
  * lon       (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
    quantile  float64 0.9

xarray.DataArray

'temperature'

lat: 50
lon: 100

dask.array<chunksize=(25, 25), meta=np.ndarray>

	Array	Chunk
Bytes	39.06 kiB	4.88 kiB
Shape	(50, 100)	(25, 25)
Count	112 Tasks	8 Chunks
Type	float64	numpy.ndarray

Coordinates: (3)

lat

(lat)

float64

-90.0 -86.33 -82.65 ... 86.33 90.0

array([-90.      , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653,
       -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837,
       -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 ,
       -23.877551, -20.204082, -16.530612, -12.857143,  -9.183673,  -5.510204,
        -1.836735,   1.836735,   5.510204,   9.183673,  12.857143,  16.530612,
        20.204082,  23.877551,  27.55102 ,  31.22449 ,  34.897959,  38.571429,
        42.244898,  45.918367,  49.591837,  53.265306,  56.938776,  60.612245,
        64.285714,  67.959184,  71.632653,  75.306122,  78.979592,  82.653061,
        86.326531,  90.      ])

lon

(lon)

float64

-180.0 -176.4 ... 172.8 176.4

array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2,
       -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8,
       -115.2, -111.6, -108. , -104.4, -100.8,  -97.2,  -93.6,  -90. ,  -86.4,
        -82.8,  -79.2,  -75.6,  -72. ,  -68.4,  -64.8,  -61.2,  -57.6,  -54. ,
        -50.4,  -46.8,  -43.2,  -39.6,  -36. ,  -32.4,  -28.8,  -25.2,  -21.6,
        -18. ,  -14.4,  -10.8,   -7.2,   -3.6,    0. ,    3.6,    7.2,   10.8,
         14.4,   18. ,   21.6,   25.2,   28.8,   32.4,   36. ,   39.6,   43.2,
         46.8,   50.4,   54. ,   57.6,   61.2,   64.8,   68.4,   72. ,   75.6,
         79.2,   82.8,   86.4,   90. ,   93.6,   97.2,  100.8,  104.4,  108. ,
        111.6,  115.2,  118.8,  122.4,  126. ,  129.6,  133.2,  136.8,  140.4,
        144. ,  147.6,  151.2,  154.8,  158.4,  162. ,  165.6,  169.2,  172.8,
        176.4])

quantile
()
float64
0.9
```
array(0.9)
```

Attributes: (0)

Dask has an approximate percentile operation that works on data chunked along time, however that will only run on one dimensional data

[5]:

dask.array.percentile(da.data[:, 30, 60], 90)

[5]:

	Array	Chunk
Bytes	8 B	8 B
Shape	(1,)	(1,)
Count	71 Tasks	1 Chunks
Type	float64	numpy.ndarray

climtas.blocked.approx_percentile extends the Dask parallel approximate percentile calculation to multi-dimensional datasets, so large datasets don’t need to be rechunked.

Note that this is an approximation to the true percentile, check it behaves appropriately for your dataset by comparing approx_percentile and numpy.percentile on a subset of the data.

[6]:

climtas.approx_percentile(da, 90, 'time')

[6]:

<xarray.DataArray 'temperature' (percentile: 1, lat: 50, lon: 100)>
dask.array<_merge_approx_percentile, shape=(1, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * lat         (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0
  * lon         (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
  * percentile  (percentile) int64 90

xarray.DataArray

'temperature'

percentile: 1
lat: 50
lon: 100

dask.array<chunksize=(1, 25, 25), meta=np.ndarray>

	Array	Chunk
Bytes	39.06 kiB	4.88 kiB
Shape	(1, 50, 100)	(1, 25, 25)
Count	176 Tasks	8 Chunks
Type	float64	numpy.ndarray

Coordinates: (3)

lat

(lat)

float64

-90.0 -86.33 -82.65 ... 86.33 90.0

array([-90.      , -86.326531, -82.653061, -78.979592, -75.306122, -71.632653,
       -67.959184, -64.285714, -60.612245, -56.938776, -53.265306, -49.591837,
       -45.918367, -42.244898, -38.571429, -34.897959, -31.22449 , -27.55102 ,
       -23.877551, -20.204082, -16.530612, -12.857143,  -9.183673,  -5.510204,
        -1.836735,   1.836735,   5.510204,   9.183673,  12.857143,  16.530612,
        20.204082,  23.877551,  27.55102 ,  31.22449 ,  34.897959,  38.571429,
        42.244898,  45.918367,  49.591837,  53.265306,  56.938776,  60.612245,
        64.285714,  67.959184,  71.632653,  75.306122,  78.979592,  82.653061,
        86.326531,  90.      ])

lon

(lon)

float64

-180.0 -176.4 ... 172.8 176.4

array([-180. , -176.4, -172.8, -169.2, -165.6, -162. , -158.4, -154.8, -151.2,
       -147.6, -144. , -140.4, -136.8, -133.2, -129.6, -126. , -122.4, -118.8,
       -115.2, -111.6, -108. , -104.4, -100.8,  -97.2,  -93.6,  -90. ,  -86.4,
        -82.8,  -79.2,  -75.6,  -72. ,  -68.4,  -64.8,  -61.2,  -57.6,  -54. ,
        -50.4,  -46.8,  -43.2,  -39.6,  -36. ,  -32.4,  -28.8,  -25.2,  -21.6,
        -18. ,  -14.4,  -10.8,   -7.2,   -3.6,    0. ,    3.6,    7.2,   10.8,
         14.4,   18. ,   21.6,   25.2,   28.8,   32.4,   36. ,   39.6,   43.2,
         46.8,   50.4,   54. ,   57.6,   61.2,   64.8,   68.4,   72. ,   75.6,
         79.2,   82.8,   86.4,   90. ,   93.6,   97.2,  100.8,  104.4,  108. ,
        111.6,  115.2,  118.8,  122.4,  126. ,  129.6,  133.2,  136.8,  140.4,
        144. ,  147.6,  151.2,  154.8,  158.4,  162. ,  165.6,  169.2,  172.8,
        176.4])

percentile
(percentile)
int64
90
```
array([90])
```

Attributes: (0)

climtas 0.3.2+9.g68ddf31.dirty documentation

Percentile

Percentile#