{ "cells": [ { "cell_type": "markdown", "id": "5ac8f107-17a2-4a76-9867-7d4f3eb6e17e", "metadata": { "tags": [] }, "source": [ "Groupby\n", "=======" ] }, { "cell_type": "code", "execution_count": 1, "id": "0635b77c-ac32-4018-8ff5-4cdd48c9f3e3", "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import xarray\n", "import climtas\n", "import dask.array\n", "import pandas\n", "import numpy" ] }, { "cell_type": "markdown", "id": "d40d0ab9-acc9-404b-9527-7f508b3a2e02", "metadata": {}, "source": [ "Say we have daily input data for several years, that we want to convert to a daily mean climatology" ] }, { "cell_type": "code", "execution_count": 2, "id": "420b86c6-d774-4c16-a865-db614f4849ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'temperature' (time: 1095, lat: 50, lon: 100)>\n",
       "dask.array<random_sample, shape=(1095, 50, 100), dtype=float64, chunksize=(90, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * time     (time) datetime64[ns] 2001-01-01 2001-01-02 ... 2003-12-31\n",
       "  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0\n",
       "  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * time (time) datetime64[ns] 2001-01-01 2001-01-02 ... 2003-12-31\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "time = pandas.date_range('20010101', '20040101', freq='D', closed='left')\n", "\n", "data = dask.array.random.random((len(time),50,100), chunks=(90,25,25))\n", "lat = numpy.linspace(-90, 90, data.shape[1])\n", "lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)\n", "\n", "da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')\n", "da" ] }, { "cell_type": "markdown", "id": "337b6e09-6013-4e48-bf7d-368f9b660fd9", "metadata": {}, "source": [ "The Xarray way is to use [xarray.DataArray.groupby](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.groupby.html), however that is an expensive function to run - we started with 104 tasks and 104 chunks in the Dask graph, and this has exploded to 23,464 tasks and 2920 chunks. For a large dataset this increase in chunk counts really bogs down Dask.\n", "\n", "The reason for this is that with `groupby` Xarray will create a new output chunk for each individual day - you can see the chunk size of the output is now `(1, 25, 25)`." ] }, { "cell_type": "code", "execution_count": 3, "id": "70ae75b4-08b5-4b2e-be40-d40d2765eb77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'temperature' (dayofyear: 365, lat: 50, lon: 100)>\n",
       "dask.array<stack, shape=(365, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat        (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n",
       "  * lon        (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n",
       "  * dayofyear  (dayofyear) int64 1 2 3 4 5 6 7 8 ... 359 360 361 362 363 364 365
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n", " * dayofyear (dayofyear) int64 1 2 3 4 5 6 7 8 ... 359 360 361 362 363 364 365" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da.groupby('time.dayofyear').mean()" ] }, { "cell_type": "markdown", "id": "be339f60-be3d-4920-a9a1-d207f19eb0ad", "metadata": {}, "source": [ "[climtas.blocked.blocked_groupby](api/blocked.rst#climtas.blocked.blocked_groupby) will as much as possible limit the number of chunks created/ It does this by reshaping the array, stacking individual years, then reducing over the new stacked axis rather than using Pandas indexing operations. It does however require the input data to be evenly spaced in time, which well-behaved datasets should be." ] }, { "cell_type": "code", "execution_count": 4, "id": "6a363b0a-8097-4277-b537-22c26649e4ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'stack-f4e41af3171d33521253e01e4a44f4a5' (dayofyear: 366, lat: 50, lon: 100)>\n",
       "dask.array<mean_agg-aggregate, shape=(366, 50, 100), dtype=float64, chunksize=(80, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat        (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n",
       "  * lon        (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n",
       "  * dayofyear  (dayofyear) int64 1 2 3 4 5 6 7 8 ... 360 361 362 363 364 365 366
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n", " * dayofyear (dayofyear) int64 1 2 3 4 5 6 7 8 ... 360 361 362 363 364 365 366" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "climtas.blocked_groupby(da, time='dayofyear').mean()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }