climtas.io
climtas.io#
Functions for reading and saving data
These functions try to use sensible chunking both for dask objects read and netcdf files written
- climtas.io.to_netcdf_series(ds: Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset], path: Union[str, pathlib.Path], groupby: str, complevel: int = 4)[source]#
Split a dataset into multiple parts, and save each part into its own file
path should be a
str.format()
-compatible string. It is formatted with three arguments: start and end, which arepandas.Timestamp
, and group which is the name of the current group being output (e.g. the year when using groupby=’time.year’). These can be used to name the file, e.g.:path_a = 'data_{group}.nc' path_b = 'data_{start.month}_{end.month}.nc' path_c = 'data_{start.year:04d}{start.month:02d}{start.day:02d}.nc'
Note that start and end are the first and last timestamps of the group’s data, which may not match the boundary start and end dates
- Parameters
da (
xarray.Dataset
orxarray.DataArray
) – Data to savepath (
str
orpathlib.Path
) – Path template to save togroupby (
str
) – Grouping, as used byxarray.DataArray.groupby()
complevel (
int
) – NetCDF compression level
- climtas.io.to_netcdf_throttled(ds: Union[xarray.core.dataarray.DataArray, xarray.core.dataset.Dataset], path: Union[str, pathlib.Path], complevel: int = 4, max_tasks: Optional[int] = None, show_progress: bool = True)[source]#
Save a DataArray to file by calculating each chunk separately (rather than submitting the whole Dask graph at once). This may be helpful when chunks are large, e.g. doing an operation on dayofyear grouping for a long timeseries.
Chunks are calculated with at most ‘max_tasks’ chunks running in parallel - this defaults to the number of workers in your dask.distributed.Client, or is 1 if distributed is not being used.
This is a very basic way to handle backpressure, where data is coming in faster than it can be processed and so fills up memory. Ideally this will be fixed in Dask itself, see e.g. https://github.com/dask/distributed/issues/2602
In particular, it will only work well if the chunks in the dataset are independent (e.g. if doing operations over a timeseries for a single horizontal chunk so the horizontal chunks are isolated).
- Parameters
da (
xarray.Dataset
orxarray.DataArray
) – Data to savepath (
str
orpathlib.Path
) – Path to save tocomplevel (
int
) – NetCDF compression levelmax_tasks (
int
) – Maximum tasks to run at once (default number of distributed workers)show_progress (
bool
) – Show a progress bar with estimated completion time