
Profiling dask data processing

  • benchmark() runs a function with different chunks, returning the time

    taken for each chunk setting

  • profile() runs a function with a single chunk setting, returning the

    time taken in different dask stages and chunk information

Profile results#


Total time taken to process the data (seconds)


Time spent opening the dataset (seconds)


Time spent running the function (seconds)


Time spent optimizing the Dask graph (seconds)


Time spent computing the data with Dask (seconds)


Chunk shape


Number of chunks in loaded data


Number of chunks in function output


Size of chunks in loaded data


Size of chunks in function output


Dask graph size in loaded data


Dask graph size in function output


Dask graph size after optimizing function output

climtas.profile.benchmark(paths: str, variable: str, chunks: Dict[str, List[int]], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Profile a function on different chunks of data

Opens a dataset with xarray.open_mfdataset() with one of the chunk options, then runs function on variable

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.benchmark(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':[93, 93], 'latitude': [91, 91], 'longitude': [180, 180*2]}) 
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to a list of chunk sizes, one entry for each run

  • function – Function that takes a xarray.DataArray (the variable) and returns a xarray.DataArray to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()


pandas.DataFrame with information from profile() for each run

climtas.profile.profile(paths: str, variable: str, chunks: Dict[str, int], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Run a function run_count times, returning the minimum time taken

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to chunk sizes

  • function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()


Dict[str, int] profiling information

climtas.profile.profile_once(paths: str, variable: str, chunks: Dict[str, int], function, mfdataset_args: Dict[str, Any] = {})[source]#

Run a single profile instance

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile_once(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to chunk sizes

  • function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()


Dict[str, Any] profiling information