climtas.profile#

Profiling dask data processing

benchmark() runs a function with different chunks, returning the time
taken for each chunk setting
profile() runs a function with a single chunk setting, returning the
time taken in different dask stages and chunk information

Profile results#

time_total
Total time taken to process the data (seconds)

time_open
Time spent opening the dataset (seconds)

time_function
Time spent running the function (seconds)

time_optimize
Time spent optimizing the Dask graph (seconds)

time_load
Time spent computing the data with Dask (seconds)

chunks
Chunk shape

nchunks_in
Number of chunks in loaded data

nchunks_out
Number of chunks in function output

chunksize_in
Size of chunks in loaded data

chunksize_out
Size of chunks in function output

tasks_in
Dask graph size in loaded data

tasks_out
Dask graph size in function output

tasks_optimized
Dask graph size after optimizing function output

climtas.profile.benchmark(paths: str, variable: str, chunks: Dict[str, List[int]], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Profile a function on different chunks of data

Opens a dataset with xarray.open_mfdataset() with one of the chunk options, then runs function on variable

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.benchmark(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':[93, 93], 'latitude': [91, 91], 'longitude': [180, 180*2]}) 

Parameters

paths – Paths to open (as xarray.open_mfdataset())
variable – Variable in the dataset to use
chunks – Mapping of dimension name to a list of chunk sizes, one entry for each run
function – Function that takes a xarray.DataArray (the variable) and returns a xarray.DataArray to test the performance of
run_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

pandas.DataFrame with information from profile() for each run

climtas.profile.profile(paths: str, variable: str, chunks: Dict[str, int], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#

Run a function run_count times, returning the minimum time taken

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}

Parameters

paths – Paths to open (as xarray.open_mfdataset())
variable – Variable in the dataset to use
chunks – Mapping of dimension name to chunk sizes
function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of
run_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

Dict[str, int] profiling information

climtas.profile.profile_once(paths: str, variable: str, chunks: Dict[str, int], function, mfdataset_args: Dict[str, Any] = {})[source]#

Run a single profile instance

>>> def func(da):
...     return t2m.mean()
>>> climtas.profile.profile_once(
...     '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc',
...     variable='t2m',
...     function=func,
...     chunks={'time':93, 'latitude': 91, 'longitude': 180}) 
{'time_total': 9.561158710159361,
 'time_open': 0.014718276914209127,
 'time_function': 0.0033595040440559387,
 'time_optimize': 0.01087462529540062,
 'time_load': 9.529402975924313,
 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180},
 'nchunks_in': 512,
 'nchunks_out': 1,
 'chunksize_in': '6.09 MB',
 'chunksize_out': '4 B',
 'tasks_in': 513,
 'tasks_out': 1098,
 'tasks_optimized': 1098}

Parameters

paths – Paths to open (as xarray.open_mfdataset())
variable – Variable in the dataset to use
chunks – Mapping of dimension name to chunk sizes
function – Function that takes a xarray.DataArray (the variable) and returns a dask.array.Array to test the performance of
run_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to xarray.open_mfdataset()

Returns

Dict[str, Any] profiling information

climtas 0.3.2+9.g68ddf31.dirty documentation

climtas.profile

Contents

climtas.profile#

Profile results#