climtas.profile
Contents
climtas.profile#
Profiling dask data processing
benchmark()
runs a function with different chunks, returning the timetaken for each chunk setting
profile()
runs a function with a single chunk setting, returning thetime taken in different dask stages and chunk information
Profile results#
- time_total
Total time taken to process the data (seconds)
- time_open
Time spent opening the dataset (seconds)
- time_function
Time spent running the function (seconds)
- time_optimize
Time spent optimizing the Dask graph (seconds)
- time_load
Time spent computing the data with Dask (seconds)
- chunks
Chunk shape
- nchunks_in
Number of chunks in loaded data
- nchunks_out
Number of chunks in function output
- chunksize_in
Size of chunks in loaded data
- chunksize_out
Size of chunks in function output
- tasks_in
Dask graph size in loaded data
- tasks_out
Dask graph size in function output
- tasks_optimized
Dask graph size after optimizing function output
- climtas.profile.benchmark(paths: str, variable: str, chunks: Dict[str, List[int]], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#
Profile a function on different chunks of data
Opens a dataset with
xarray.open_mfdataset()
with one of the chunk options, then runs function on variable>>> def func(da): ... return t2m.mean() >>> climtas.profile.benchmark( ... '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc', ... variable='t2m', ... function=func, ... chunks={'time':[93, 93], 'latitude': [91, 91], 'longitude': [180, 180*2]})
- Parameters
paths – Paths to open (as
xarray.open_mfdataset()
)variable – Variable in the dataset to use
chunks – Mapping of dimension name to a list of chunk sizes, one entry for each run
function – Function that takes a
xarray.DataArray
(the variable) and returns axarray.DataArray
to test the performance ofrun_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to
xarray.open_mfdataset()
- Returns
pandas.DataFrame
with information fromprofile()
for each run
- climtas.profile.profile(paths: str, variable: str, chunks: Dict[str, int], function, run_count: int = 3, mfdataset_args: Dict[str, Any] = {})[source]#
Run a function run_count times, returning the minimum time taken
>>> def func(da): ... return t2m.mean() >>> climtas.profile.profile( ... '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc', ... variable='t2m', ... function=func, ... chunks={'time':93, 'latitude': 91, 'longitude': 180}) {'time_total': 9.561158710159361, 'time_open': 0.014718276914209127, 'time_function': 0.0033595040440559387, 'time_optimize': 0.01087462529540062, 'time_load': 9.529402975924313, 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180}, 'nchunks_in': 512, 'nchunks_out': 1, 'chunksize_in': '6.09 MB', 'chunksize_out': '4 B', 'tasks_in': 513, 'tasks_out': 1098, 'tasks_optimized': 1098}
- Parameters
paths – Paths to open (as
xarray.open_mfdataset()
)variable – Variable in the dataset to use
chunks – Mapping of dimension name to chunk sizes
function – Function that takes a
xarray.DataArray
(the variable) and returns adask.array.Array
to test the performance ofrun_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to
xarray.open_mfdataset()
- Returns
Dict[str, int] profiling information
- climtas.profile.profile_once(paths: str, variable: str, chunks: Dict[str, int], function, mfdataset_args: Dict[str, Any] = {})[source]#
Run a single profile instance
>>> def func(da): ... return t2m.mean() >>> climtas.profile.profile_once( ... '/g/data/ub4/era5/netcdf/surface/t2m/2019/t2m_era5_global_20190101_*.nc', ... variable='t2m', ... function=func, ... chunks={'time':93, 'latitude': 91, 'longitude': 180}) {'time_total': 9.561158710159361, 'time_open': 0.014718276914209127, 'time_function': 0.0033595040440559387, 'time_optimize': 0.01087462529540062, 'time_load': 9.529402975924313, 'chunks': {'time': 93, 'latitude': 91, 'longitude': 180}, 'nchunks_in': 512, 'nchunks_out': 1, 'chunksize_in': '6.09 MB', 'chunksize_out': '4 B', 'tasks_in': 513, 'tasks_out': 1098, 'tasks_optimized': 1098}
- Parameters
paths – Paths to open (as
xarray.open_mfdataset()
)variable – Variable in the dataset to use
chunks – Mapping of dimension name to chunk sizes
function – Function that takes a
xarray.DataArray
(the variable) and returns adask.array.Array
to test the performance ofrun_count – Number of times to run each profile (the minimum time is returned)
mfdataset_args – Extra arguments to pass to
xarray.open_mfdataset()
- Returns
Dict[str, Any] profiling information