Functions for locating and analysing ‘events’ within a dataset

Locate where events are with find_events(), then analyse them with map_events() to create a pandas.DataFrame.

climtas.event.atleastn(da: xarray.core.dataarray.DataArray, n: int, dim: str = 'time') xarray.core.dataarray.DataArray[source]#

Filter to return values with at least n contiguous points around them

>>> da = xarray.DataArray([0,1.4,0.8,1,-0.1,2.9,0.6], dims=['time'])
>>> atleastn(da.where(da > 0), 3)
<xarray.DataArray (time: 7)>
array([nan, 1.4, 0.8, 1. , nan, nan, nan])
Dimensions without coordinates: time
  • da (xarray.DataArray) – Pre-filtered event values

  • n (int) – Minimum event length

  • dim (str) – Dimension to work on


xarray.DataArray with events from da that are longer than n along dimension dim

climtas.event.event_coords(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#

Converts the index values returned by find_events() to coordinate values

>>> da = xarray.DataArray([0,1,1,1,0,1,1], coords=[('time', pandas.date_range('20010101', periods=7, freq='D'))])
>>> events = find_events(da > 0)
>>> event_coords(da, events)
        time event_duration
0 2001-01-02         3 days
1 2001-01-06            NaT

If ‘events’ has an ‘event_duration’ column this will be converted to a time duration. If the event goes to the end of the data the duration is marked as not a time, as the end date is unknown.


pandas.DataFrame with the same columns as ‘events’, but with index values converted to coordinates

climtas.event.event_da(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, values: numpy.ndarray) xarray.core.dataarray.DataArray[source]#

Create a xarray.DataArray with ‘values’ at event locations


xarray.DataArray with the same axes as da and values at event given by values.g

climtas.event.event_values(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, use_dask=True) dask.dataframe.core.DataFrame[source]#

Gets the values from da where an event is active

Note that for a large dataset with many events this can consume a considerable amount of memory. use_dask=True will return an uncomputed dask.dataframe.DataFrame instead of a pandas.DataFrame, potentially saving memory. You can compute the results later with dask.dataframe.DataFrame.compute().

>>> da = xarray.DataArray(
...     [0,3,6,2,0,8,6],
...     coords=[('time',
...              pandas.date_range('20010101', periods=7, freq='D'))])
>>> events = find_events(da > 0)
>>> event_values(da, events)
   time  event_id  value
0     1         0      3
1     2         0      6
2     3         0      2
3     5         1      8
4     6         1      6

See the Dask DataFrame documentation for performance information. In general basic reductions (min, mean, max, etc.) should be fast.

>>> values = event_values(da, events)
>>> values.groupby('event_id').value.mean()
0    3.666667
1    7.000000
Name: value, dtype: float64

For custom aggregations setting an index may help performance. This sorts the data though, so may use a lot of memory

>>> values = values.set_index('event_id')
>>> values.groupby('event_id').value.apply(lambda x: x.min())
0    2
1    6
Name: value, dtype: int64

dask.dataframe.DataFrame with columns event_id, time, value

climtas.event.event_values_block(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, offset: Union[Iterable[int], Dict[Hashable, int]], load: bool = True)[source]#

Gets the values from da where an event is active

climtas.event.extend_events(events: pandas.core.frame.DataFrame)[source]#

Extend the ‘events’ DataFrame to hold indices for the full event duration

find_events() returns only the start index of events. This will extend the DataFrame to cover the indices of the entire event. In addition to the indices a column ‘event_id’ gives the matching index in ‘event’ for the row

>>> da = xarray.DataArray([0,1,1,1,0,1,1], coords=[('time', pandas.date_range('20010101', periods=7, freq='D'))])
>>> events = find_events(da > 0)
>>> extend_events(events)
   time  event_id
0     1         0
1     2         0
2     3         0
3     5         1
4     6         1
climtas.event.filter_block(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, offset: Dict[Hashable, int]) pandas.core.frame.DataFrame[source]#

Filters events to within the current block horizontally

climtas.event.find_events(da: xarray.core.dataarray.DataArray, min_duration: int = 1, use_dask: Optional[bool] = None, compute: bool = True) pandas.core.frame.DataFrame[source]#

Find ‘events’ in a DataArray mask

Events are defined as being active when the array value is truthy. You should generally pass in the results of a comparison against some kind of threshold

>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time'])
>>> find_events(da > 0)
   time  event_duration
0     1               3
1     5               2

It’s assumed that events are reasonably sparse for large arrays

If use_dask is True or use_dask is None and the source dataset is only chunked horizontally then events will be searched for in each Dask chunk and the results aggregated afterward.

  • da (xarray.DataArray) – Input mask, valid when an event is active. Must have a ‘time’ dimension, dtype is expected to be bool (or something else that is truthy when an event is active)

  • min_duration (int) – Minimum event duration to return

  • use_dask (bool) – Enable Dask parallelism (default True if da is chunked)

  • compute (bool) – Compute the dask operations. Note that if False the dataframe index will not have unique values


A pandas.DataFrame containing event start points and durations. This will contain columns for each dimension in da, as well as an ‘event_duration’ column

climtas.event.find_events_block(da: xarray.core.dataarray.DataArray, offset: Iterable[int], min_duration: int = 1, load: bool = True) pandas.core.frame.DataFrame[source]#

Find ‘events’ in a section of a DataArray

See find_events()

Intended to run on a Dask block, with the results from each block merged.

If an event is active at the first or last timestep it is returned regardless of its duration, so it can be joined with neighbours

climtas.event.join_events(events: List[pandas.core.frame.DataFrame], offsets: Optional[List[List[int]]] = None, dims: Optional[Tuple[Hashable, ...]] = None) pandas.core.frame.DataFrame[source]#

Join consecutive events in multiple dataframes

The returned events will be in an arbitrary order, the index may not match entries from the input dataframes.

>>> events = [
...    pandas.DataFrame([[1, 2]], columns=["time", "event_duration"]),
...    pandas.DataFrame([[3, 1]], columns=["time", "event_duration"]),
... ]
>>> join_events(events)
   time  event_duration
0     1               3

events – List of results from find_events()


pandas.DataFrame where results that end when the next event starts are joined together

climtas.event.map_events(da: xarray.core.dataarray.DataArray, events: pandas.core.frame.DataFrame, func, *args, **kwargs) pandas.core.frame.DataFrame[source]#

Map a function against multiple events

The output is the value from func evaluated at each of the events. Events should at a minimum have columns for each coordinate in da as well as an ‘event_duration’ column that records how long each event is, as is returned by find_events():

>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time'])
>>> events = find_events(da > 0)
>>> map_events(da, events, lambda x: x.sum().item())
0    3
1    2
dtype: int64

You may wish to filter the events DataFrame first to combine close events or to remove very short events.

If func returns a dict results will be converted into columns. This will be more efficient than running map_events once for each operation:

>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time'])
>>> events = find_events(da > 0)
>>> map_events(da, events, lambda x: {'mean': x.mean().item(), 'std': x.std().item()})
   mean  std
0   1.0  0.0
1   1.0  0.0

pandas.DataFrame.join() can be used to link up the results with their corresponding coordinates:

>>> da = xarray.DataArray([0,1,1,1,0,1,1], dims=['time'])
>>> events = find_events(da > 0)
>>> sums = map_events(da, events, lambda x: {'sum': x.sum().item()})
>>> events.join(sums)
   time  event_duration  sum
0     1               3    3
1     5               2    2

pandas.DataFrame with each row the result of applying func to the corresponding event row. Behaves like pandas.DataFrame.apply() with result_type=’expand’