Functions for reading and saving data

These functions try to use sensible chunking both for dask objects read and netcdf files written Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset], path: Union[str, pathlib.Path], groupby: str, complevel: int = 4)[source]#

Split a dataset into multiple parts, and save each part into its own file

path should be a str.format()-compatible string. It is formatted with three arguments: start and end, which are pandas.Timestamp, and group which is the name of the current group being output (e.g. the year when using groupby=’time.year’). These can be used to name the file, e.g.:

path_a = 'data_{group}.nc'
path_b = 'data_{start.month}_{end.month}.nc'
path_c = 'data_{start.year:04d}{start.month:02d}{}.nc'

Note that start and end are the first and last timestamps of the group’s data, which may not match the boundary start and end dates

Parameters Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset], path: Union[str, pathlib.Path], complevel: int = 4, max_tasks: Optional[int] = None, show_progress: bool = True)[source]#

Save a DataArray to file by calculating each chunk separately (rather than submitting the whole Dask graph at once). This may be helpful when chunks are large, e.g. doing an operation on dayofyear grouping for a long timeseries.

Chunks are calculated with at most ‘max_tasks’ chunks running in parallel - this defaults to the number of workers in your dask.distributed.Client, or is 1 if distributed is not being used.

This is a very basic way to handle backpressure, where data is coming in faster than it can be processed and so fills up memory. Ideally this will be fixed in Dask itself, see e.g.

In particular, it will only work well if the chunks in the dataset are independent (e.g. if doing operations over a timeseries for a single horizontal chunk so the horizontal chunks are isolated).

  • da (xarray.Dataset or xarray.DataArray) – Data to save

  • path (str or pathlib.Path) – Path to save to

  • complevel (int) – NetCDF compression level

  • max_tasks (int) – Maximum tasks to run at once (default number of distributed workers)

  • show_progress (bool) – Show a progress bar with estimated completion time