climtas.blocked#

Xarray operations that act per Dask block

groupby#

climtas.blocked.blocked_groupby(da: xarray.core.dataarray.DataArray, indexer=None, **kwargs) climtas.blocked.BlockedGroupby[source]#

Create a blocked groupby

Mostly works like xarray.groupby(), however this will have better chunking behaviour at the expense of only working with data regularly spaced in time.

grouping may be one of:

  • ‘dayofyear’: Group by number of days since the start of the year

  • ‘monthday’: Group by (‘month’, ‘day’)

>>> time = pandas.date_range('20020101','20050101', freq='D', closed='left')
>>> hourly = xarray.DataArray(numpy.random.random(time.size), coords=[('time', time)])
>>> blocked_doy_max = blocked_groupby(hourly, time='dayofyear').max()
>>> xarray_doy_max = hourly.groupby('time.dayofyear').max()
>>> xarray.testing.assert_equal(blocked_doy_max, xarray_doy_max)
Parameters
  • da (xarray.DataArray) – Resample target

  • indexer/kwargs (Dict[dim, grouping]) – Mapping of dimension name to grouping type

Returns

BlockedGroupby

class climtas.blocked.BlockedGroupby(da: xarray.core.dataarray.DataArray, grouping: str, dim: str = 'time')[source]#

A blocked groupby operation, created by blocked_groupby()

Works like xarray.core.groupby.DataArrayGroupBy, with the constraint that the data contains no partial years

The benefit of this restriction is that no extra Dask chunks are created by the grouping, which is important for large datasets.

apply(op: climtas.blocked.DataArrayFunction, **kwargs) xarray.core.dataarray.DataArray[source]#

Apply a function to the blocked data

self.da is blocked to replace the self.dim dimension with two new dimensions, ‘year’ and self.grouping. op is then run on the data, and the result is converted back to the shape of self.da.

Use this to e.g. group the data by ‘dayofyear’, then rank each ‘dayofyear’ over the ‘year’ dimension

Parameters
Returns

xarray.DataArray shaped like self.da

block_dataarray() xarray.core.dataarray.DataArray[source]#

Reshape self.da to have a ‘year’ and a self.grouping axis

The self.dim axis is grouped up into individual years, then for each year that group’s self.dim is converted into self.grouping, so that leap years and non-leap years have the same length. The groups are then stacked together to create a new DataArray with ‘year’ as the first dimension and self.grouping replacing self.dim.

Data for a leap year self.grouping in a non-leap year is NAN

Returns

The reshaped xarray.DataArray

See:

  • apply() will block the data, apply a function and then unblock the data

  • unblock_dataarray() will convert a DataArray shaped like this method’s output back into a DataArray shaped like self.da

max() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.max()

See: reduce()

mean() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.mean()

See: reduce()

min() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.min()

See: reduce()

nanpercentile(q: float) xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.nanpercentile() over the ‘year’ axis

Slower than percentile(), but will be correct if there’s missing data (e.g. on leap days)

Parameters

q (float) – Percentile within the interval [0, 100]

See: reduce(), percentile()

percentile(q: float) xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.percentile() over the ‘year’ axis

Faster than nanpercentile(), but may be incorrect if there’s missing data (e.g. on leap days)

Parameters

q (float) – Percentile within the interval [0, 100]

See: reduce(), nanpercentile()

rank(method: str = 'average') xarray.core.dataarray.DataArray[source]#

Rank the samples using scipy.stats.rankdata() over the ‘year’ axis

Parameters

method – See scipy.stats.rankdata()

See: apply()

reduce(op: climtas.blocked.DataArrayFunction, **kwargs) xarray.core.dataarray.DataArray[source]#

Reduce the data over ‘year’ using op

self.da is blocked to replace the self.dim dimension with two new dimensions, ‘year’ and self.grouping. op is then run on the data to remove the ‘year’ dimension

Note there will be NAN values in the data when there isn’t a self.grouping value for that year (e.g. dayofyear = 366 or (month, day) = (2, 29) in a non-leap year)

Use this to e.g. group the data by ‘dayofyear’, then get the mean values at each ‘dayofyear’

Parameters
Returns

xarray.DataArray shaped like self.da, but with self.dim replaced by self.grouping

sum() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.sum()

See: reduce()

unblock_dataarray(da: xarray.core.dataarray.DataArray) xarray.core.dataarray.DataArray[source]#

Inverse of block_dataarray()

Given a DataArray constructed by block_dataarray(), returns an ungrouped DataArray with the original self.dim axis from self.da.

Data for a leap year self.grouping in a non-leap year is dropped

percentile#

climtas.blocked.approx_percentile(da: Union[xarray.core.dataarray.DataArray, dask.array.core.Array, numpy.ndarray], q, dim: Optional[str] = None, axis: Optional[int] = None, skipna: bool = True)[source]#

Return an approximation of the qth percentile along a dimension of da

For large Dask datasets the approximation will compute much faster than numpy.percentile()

If da contains Dask data, it will use Dask’s approximate percentile algorithim extended to multiple dimensions, see dask.array.percentile()

If da contains Numpy data it will use numpy.percentile()

Parameters
  • da – Input dataset

  • q – Percentile to calculate in the range [0,100]

  • dim – Dimension name to reduce (xarray data only)

  • axis – Axis number to reduce

  • skipna – Ignore NaN values (like numpy.nanpercentile())

Returns

Array of the same type as da, otherwise as numpy.percentile()

climtas.blocked.dask_approx_percentile(array: dask.array.routines.array, pcts, axis: int, interpolation='linear', skipna=True)[source]#

Get the approximate percentiles of a Dask array along ‘axis’, using the ‘dask’ method of dask.array.percentile().

Parameters
  • array – Dask Nd array

  • pcts – List of percentiles to calculate, within the interval [0,100]

  • axis – Axis to reduce

  • skipna – Ignore NaN values (like numpy.nanpercentile()) if true

Returns

Dask array with first axis the percentiles from ‘pcts’, remaining axes from ‘array’ reduced along ‘axis’

resample#

climtas.blocked.blocked_resample(da: xarray.core.dataarray.DataArray, indexer=None, **kwargs) climtas.blocked.BlockedResampler[source]#

Create a blocked resampler

Mostly works like xarray.resample(), however unlike Xarray’s resample this will maintain the same number of Dask chunks

The input data is grouped into blocks of length count along dim for further operations (see BlockedResampler)

Count must evenly divide the size of each block along the target axis

>>> time = pandas.date_range('20010101','20010110', freq='H', closed='left')
>>> hourly = xarray.DataArray(numpy.random.random(time.size), coords=[('time', time)])
>>> blocked_daily_max = blocked_resample(hourly, time='1D').max()
>>> xarray_daily_max = hourly.resample(time='1D').max()
>>> xarray.testing.assert_identical(blocked_daily_max, xarray_daily_max)
>>> blocked_daily_max = blocked_resample(hourly, time=24).max()
>>> xarray_daily_max = hourly.resample(time='1D').max()
>>> xarray.testing.assert_identical(blocked_daily_max, xarray_daily_max)
Parameters
  • da (xarray.DataArray) – Resample target

  • indexer/kwargs (Dict[dim, count]) – Mapping of dimension name to count along that axis. May be an integer or a time interval understood by pandas (that interval must evenly divide the dataset).

Returns

BlockedResampler

class climtas.blocked.BlockedResampler(da: xarray.core.dataarray.DataArray, dim: str, count: int)[source]#

A blocked resampling operation, created by blocked_resample()

Works like xarray.core.resample.DataarrayResample, with the constraint that the resampling is a regular interval, and that the resampling interval evenly divides the length along dim of every Dask chunk in da.

The benefit of this restriction is that no extra Dask chunks are created by the resampling, which is important for large datasets.

max() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.max()

mean() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.mean()

min() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.min()

nanmax() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.nanmax()

nanmin() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.nanmin()

reduce(op: Callable, **kwargs) xarray.core.dataarray.DataArray[source]#

Apply an arbitrary operation to each resampled group

The function op is applied to each group. The grouping axis is given by axis, this axis should be reduced out by op (e.g. like numpy.mean() does)

Parameters
  • op ((numpy.array, axis, **kwargs) -> numpy.array) – Function to reduce out the resampled dimension

  • **kwargs – Passed to op

Returns

A resampled xarray.DataArray, where every self.count values along self.dim have been reduced by op

sum() xarray.core.dataarray.DataArray[source]#

Reduce the samples using numpy.sum()