climtas.profile#

Profiling dask data processing

  • benchmark() runs a function with different chunks, returning the time

    taken for each chunk setting

  • profile() runs a function with a single chunk setting, returning the

    time taken in different dask stages and chunk information

Profile results#

time_total

Total time taken to process the data (seconds)

time_open

Time spent opening the dataset (seconds)

time_function

Time spent running the function (seconds)

time_optimize

Time spent optimizing the Dask graph (seconds)

time_load

Time spent computing the data with Dask (seconds)

chunks

Chunk shape

nchunks_in

Number of chunks in loaded data

nchunks_out

Number of chunks in function output

chunksize_in

Size of chunks in loaded data

chunksize_out

Size of chunks in function output

tasks_in

Dask graph size in loaded data

tasks_out

Dask graph size in function output

tasks_optimized

Dask graph size after optimizing function output

climtas.profile.benchmark(paths: str, variable: str, chunks: Dict[str, List[int]], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Profile a function on different chunks of data

Opens a dataset with xarray.open_mfdataset() with one of the chunk options, then runs function on variable

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.benchmark(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':[93, 93], 'latitude': [91, 91], 'longitude': [180, 180*2]}) 
Parameters
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to a list of chunk sizes, one entry for each run

  • function – Function that takes a xarray.DataArray (the variable) and returns a xarray.DataArray to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

pandas.DataFrame with information from profile() for each run

climtas.profile.profile(paths: str, variable: str, chunks: Dict[str, int], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Run a function run_count times, returning the minimum time taken

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}
Parameters
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to chunk sizes

  • function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

Dict[str, int] profiling information

climtas.profile.profile_once(paths: str, variable: str, chunks: Dict[str, int], function, mfdataset_args: Dict[str, Any] = {})[source]#

Run a single profile instance

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile_once(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}
Parameters
  • paths – Paths to open (as xarray.open_mfdataset())

  • variable – Variable in the dataset to use

  • chunks – Mapping of dimension name to chunk sizes

  • function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of

  • run_count – Number of times to run each profile (the minimum time is returned)

  • mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

Dict[str, Any] profiling information