Percentile#

[1]:
import xarray
import climtas
import dask.array
import pandas
import numpy

We have hourly input data for a year that we want calculate the 90th percentile along the time axis for each grid point

[2]:
time = pandas.date_range('20010101', '20020101', freq='H', closed='left')

data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)

da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da
[2]:
<xarray.DataArray 'temperature' (time: 8760, lat: 50, lon: 100)>
dask.array<random_sample, shape=(8760, 50, 100), dtype=float64, chunksize=(1440, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4

The Xarray way is to use xarray.DataArray.quantile, however for a dataset chunked along the time axis this will give an error message:

[3]:
try:
    da.quantile(0.9, 'time')
except Exception as e:
    print('Error:', e)
Error: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.

To get this to work we must rechunk the data so it is not chunked on the time axis, a very expensive operation for large datasets where variables are split into multiple files for different years.

[4]:
da.chunk({'time': None}).quantile(0.9, 'time')
[4]:
<xarray.DataArray 'temperature' (lat: 50, lon: 100)>
dask.array<getitem, shape=(50, 100), dtype=float64, chunksize=(25, 25), chunktype=numpy.ndarray>
Coordinates:
  * lat       (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0
  * lon       (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
    quantile  float64 0.9

Dask has an approximate percentile operation that works on data chunked along time, however that will only run on one dimensional data

[5]:
dask.array.percentile(da.data[:, 30, 60], 90)
[5]:
Array Chunk
Bytes 8 B 8 B
Shape (1,) (1,)
Count 71 Tasks 1 Chunks
Type float64 numpy.ndarray
1 1

climtas.blocked.approx_percentile extends the Dask parallel approximate percentile calculation to multi-dimensional datasets, so large datasets don’t need to be rechunked.

Note that this is an approximation to the true percentile, check it behaves appropriately for your dataset by comparing approx_percentile and numpy.percentile on a subset of the data.

[6]:
climtas.approx_percentile(da, 90, 'time')
[6]:
<xarray.DataArray 'temperature' (percentile: 1, lat: 50, lon: 100)>
dask.array<_merge_approx_percentile, shape=(1, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * lat         (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0
  * lon         (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
  * percentile  (percentile) int64 90