Resample#

[1]:
import xarray
import climtas
import dask.array
import pandas
import numpy

Say we have hourly input data for a year that we want to convert to daily data

[2]:
time = pandas.date_range('20010101', '20020101', freq='H', closed='left')

data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))
lat = numpy.linspace(-90, 90, data.shape[1])
lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)

da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')
da
[2]:
<xarray.DataArray 'temperature' (time: 8760, lat: 50, lon: 100)>
dask.array<random_sample, shape=(8760, 50, 100), dtype=float64, chunksize=(1440, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4

The Xarray way is to use xarray.DataArray.resample, however that is an expensive function to run - we started with 56 tasks and 56 chunks in the Dask graph, and this has exploded to 11,736 tasks and 2920 chunks. For a large dataset this increase in chunk counts really bogs down Dask.

The reason for this is that with resample Xarray will create a new output chunk for each individual day - you can see the chunk size of the output is now (1, 25, 25).

[3]:
da.resample(time='D').mean()
[3]:
<xarray.DataArray 'temperature' (time: 365, lat: 50, lon: 100)>
dask.array<stack, shape=(365, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 2001-01-02 ... 2001-12-31
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4

A better way to do this is to use xarray.DataArray.coarsen to do the resampling. This keeps the original number of chunks, but has the drawback that it’s not aware of the time axis, you need to specify that it should be reduced by 24 samples. It also won’t complain if the time axis is uneven, however for most well-behaved datasets this shouldn’t be an issue.

[4]:
da.coarsen(time=24).mean()
[4]:
<xarray.DataArray 'temperature' (time: 365, lat: 50, lon: 100)>
dask.array<mean_agg-aggregate, shape=(365, 50, 100), dtype=float64, chunksize=(60, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01T11:30:00 ... 2001-12-31T11:30:00
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4

climtas.blocked.blocked_resample works the same as coarsen, giving you the same number of chunks as you started with, and it is also time-axis aware - it will check to make sure that the time axis is evenly spaced and you can use Pandas time interval names instead of a sample count.

[5]:
climtas.blocked_resample(da, time='D').mean()
[5]:
<xarray.DataArray 'temperature' (time: 365, lat: 50, lon: 100)>
dask.array<resample_op, shape=(365, 50, 100), dtype=float64, chunksize=(60, 25, 25), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 2001-01-02 ... 2001-12-31
  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0
  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4