climtas.event
climtas.event#
Functions for locating and analysing ‘events’ within a dataset
Locate where events are with find_events()
, then analyse them with
map_events()
to create a pandas.DataFrame
.
- climtas.event.atleastn(da: xarray.core.dataarray.DataArray, n: int, dim: str = 'time') xarray.core.dataarray.DataArray [source]#
Filter to return values with at least n contiguous points around them
>>> da = xarray.DataArray([0,1.4,0.8,1,-0.1,2.9,0.6], dims=['time']) >>> atleastn(da.where(da > 0), 3) <xarray.DataArray (time: 7)> array([nan, 1.4, 0.8, 1. , nan, nan, nan]) Dimensions without coordinates: time
- Parameters
da (
xarray.DataArray
) – Pre-filtered event valuesn (
int
) – Minimum event lengthdim (
str
) – Dimension to work on
- Returns
xarray.DataArray
with events from da that are longer than n along dimension dim
- climtas.event.event_coords(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]#
Converts the index values returned by
find_events()
to coordinate values>>> da = xarray.DataArray([0,1,1,1,0,1,1], coords=[('time', pandas.date_range('20010101', periods=7, freq='D'))]) >>> events = find_events(da > 0) >>> event_coords(da, events) time event_duration 0 2001-01-02 3 days 1 2001-01-06 NaT
If ‘events’ has an ‘event_duration’ column this will be converted to a time duration. If the event goes to the end of the data the duration is marked as not a time, as the end date is unknown.
- Parameters
da (
xarray.DataArray
) – Source data valuesevents (
pandas.DataFrame
) – Event start & durations, e.g. fromfind_events()
orextend_events()
- Returns
pandas.DataFrame
with the same columns as ‘events’, but with index values converted to coordinates
- climtas.event.event_da(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, values: numpy.ndarray) xarray.core.dataarray.DataArray [source]#
Create a
xarray.DataArray
with ‘values’ at event locations- Parameters
da (
xarray.DataArray
) – Source data valuesevents (
pandas.DataFrame
) – Index values, e.g. fromfind_events()
orextend_events()
values (
numpy.ndarray
-like) – Value to give to each location specified by event
- Returns
xarray.DataArray
with the same axes as da and values at event given by values.g
- climtas.event.event_values(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, use_dask=True) dask.dataframe.core.DataFrame [source]#
Gets the values from da where an event is active
Note that for a large dataset with many events this can consume a considerable amount of memory. use_dask=True will return an uncomputed
dask.dataframe.DataFrame
instead of apandas.DataFrame
, potentially saving memory. You can compute the results later withdask.dataframe.DataFrame.compute()
.>>> da = xarray.DataArray( ... [0,3,6,2,0,8,6], ... coords=[('time', ... pandas.date_range('20010101', periods=7, freq='D'))]) >>> events = find_events(da > 0) >>> event_values(da, events) time event_id value 0 1 0 3 1 2 0 6 2 3 0 2 3 5 1 8 4 6 1 6
See the Dask DataFrame documentation for performance information. In general basic reductions (min, mean, max, etc.) should be fast.
>>> values = event_values(da, events) >>> values.groupby('event_id').value.mean() event_id 0 3.666667 1 7.000000 Name: value, dtype: float64
For custom aggregations setting an index may help performance. This sorts the data though, so may use a lot of memory
>>> values = values.set_index('event_id') >>> values.groupby('event_id').value.apply(lambda x: x.min()) event_id 0 2 1 6 Name: value, dtype: int64
- Parameters
da (
xarray.DataArray
) – Source data valuesevents (
pandas.DataFrame
) – Event start & durations, e.g. fromfind_events()
use_dask (bool) – If true, returns an uncomputed Dask DataFrame. If false, computes the values before returning
- Returns
dask.dataframe.DataFrame
with columns event_id, time, value
- climtas.event.event_values_block(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, offset: Union[Iterable[int], Dict[Hashable, int]], load: bool = True)[source]#
Gets the values from da where an event is active
- climtas.event.extend_events(events: pandas.core.frame.DataFrame)[source]#
Extend the ‘events’ DataFrame to hold indices for the full event duration
find_events()
returns only the start index of events. This will extend the DataFrame to cover the indices of the entire event. In addition to the indices a column ‘event_id’ gives the matching index in ‘event’ for the row>>> da = xarray.DataArray([0,1,1,1,0,1,1], coords=[('time', pandas.date_range('20010101', periods=7, freq='D'))]) >>> events = find_events(da > 0) >>> extend_events(events) time event_id 0 1 0 1 2 0 2 3 0 3 5 1 4 6 1
- Parameters
da (
xarray.DataArray
) – Source data valuesevents (
pandas.DataFrame
) – Event start & durations, e.g. fromfind_events()
- climtas.event.filter_block(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, offset: Dict[Hashable, int]) pandas.core.frame.DataFrame [source]#
Filters events to within the current block horizontally
- climtas.event.find_events(da: xarray.core.dataarray.DataArray, min_duration: int = 1, use_dask: Optional[bool] = None, compute: bool = True) pandas.core.frame.DataFrame [source]#
Find ‘events’ in a DataArray mask
Events are defined as being active when the array value is truthy. You should generally pass in the results of a comparison against some kind of threshold
>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time']) >>> find_events(da > 0) time event_duration 0 1 3 1 5 2
It’s assumed that events are reasonably sparse for large arrays
If use_dask is True or use_dask is None and the source dataset is only chunked horizontally then events will be searched for in each Dask chunk and the results aggregated afterward.
- Parameters
da (
xarray.DataArray
) – Input mask, valid when an event is active. Must have a ‘time’ dimension, dtype is expected to be bool (or something else that is truthy when an event is active)min_duration (
int
) – Minimum event duration to returnuse_dask (
bool
) – Enable Dask parallelism (default True if da is chunked)compute (
bool
) – Compute the dask operations. Note that if False the dataframe index will not have unique values
- Returns
A
pandas.DataFrame
containing event start points and durations. This will contain columns for each dimension in da, as well as an ‘event_duration’ column
- climtas.event.find_events_block(da: xarray.core.dataarray.DataArray, offset: Iterable[int], min_duration: int = 1, load: bool = True) pandas.core.frame.DataFrame [source]#
Find ‘events’ in a section of a DataArray
See
find_events()
Intended to run on a Dask block, with the results from each block merged.
If an event is active at the first or last timestep it is returned regardless of its duration, so it can be joined with neighbours
- climtas.event.join_events(events: List[pandas.core.frame.DataFrame], offsets: Optional[List[List[int]]] = None, dims: Optional[Tuple[Hashable, ...]] = None) pandas.core.frame.DataFrame [source]#
Join consecutive events in multiple dataframes
The returned events will be in an arbitrary order, the index may not match entries from the input dataframes.
>>> events = [ ... pandas.DataFrame([[1, 2]], columns=["time", "event_duration"]), ... pandas.DataFrame([[3, 1]], columns=["time", "event_duration"]), ... ]
>>> join_events(events) time event_duration 0 1 3
- Parameters
events – List of results from
find_events()
- Returns
pandas.DataFrame
where results that end when the next event starts are joined together
- climtas.event.map_events(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, func, *args, **kwargs) pandas.core.frame.DataFrame [source]#
Map a function against multiple events
The output is the value from func evaluated at each of the events. Events should at a minimum have columns for each coordinate in da as well as an ‘event_duration’ column that records how long each event is, as is returned by
find_events()
:>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time']) >>> events = find_events(da > 0) >>> map_events(da, events, lambda x: x.sum().item()) 0 3 1 2 dtype: int64
You may wish to filter the events DataFrame first to combine close events or to remove very short events.
If func returns a dict results will be converted into columns. This will be more efficient than running map_events once for each operation:
>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time']) >>> events = find_events(da > 0) >>> map_events(da, events, lambda x: {'mean': x.mean().item(), 'std': x.std().item()}) mean std 0 1.0 0.0 1 1.0 0.0
pandas.DataFrame.join()
can be used to link up the results with their corresponding coordinates:>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time']) >>> events = find_events(da > 0) >>> sums = map_events(da, events, lambda x: {'sum': x.sum().item()}) >>> events.join(sums) time event_duration sum 0 1 3 3 1 5 2 2
- Parameters
da (
xarray.DataArray
) – Source data valuesevents (
pandas.DataFrame
) – Event start & durations, e.g. fromfind_events()
func ((
xarray.DataArray
, *args, **kwargs) -> Dict[str, Any]) – Function to apply to each event*args – Passed to func
**kwargs – Passed to func
- Returns
pandas.DataFrame
with each row the result of applying func to the corresponding event row. Behaves likepandas.DataFrame.apply()
with result_type=’expand’