{ "cells": [ { "cell_type": "markdown", "id": "5ac8f107-17a2-4a76-9867-7d4f3eb6e17e", "metadata": { "tags": [] }, "source": [ "Percentile\n", "==========" ] }, { "cell_type": "code", "execution_count": 1, "id": "0635b77c-ac32-4018-8ff5-4cdd48c9f3e3", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import xarray\n", "import climtas\n", "import dask.array\n", "import pandas\n", "import numpy" ] }, { "cell_type": "markdown", "id": "d40d0ab9-acc9-404b-9527-7f508b3a2e02", "metadata": {}, "source": [ "We have hourly input data for a year that we want calculate the 90th percentile along the time axis for each grid point" ] }, { "cell_type": "code", "execution_count": 2, "id": "420b86c6-d774-4c16-a865-db614f4849ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'temperature' (time: 8760, lat: 50, lon: 100)>\n",
       "dask.array<random_sample, shape=(8760, 50, 100), dtype=float64, chunksize=(1440, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * time     (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00\n",
       "  * lat      (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0\n",
       "  * lon      (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * time (time) datetime64[ns] 2001-01-01 ... 2001-12-31T23:00:00\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 78.98 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "time = pandas.date_range('20010101', '20020101', freq='H', closed='left')\n", "\n", "data = dask.array.random.random((len(time),50,100), chunks=(24*60,25,25))\n", "lat = numpy.linspace(-90, 90, data.shape[1])\n", "lon = numpy.linspace(-180, 180, data.shape[2], endpoint=False)\n", "\n", "da = xarray.DataArray(data, coords=[('time', time), ('lat', lat), ('lon', lon)], name='temperature')\n", "da" ] }, { "cell_type": "markdown", "id": "337b6e09-6013-4e48-bf7d-368f9b660fd9", "metadata": {}, "source": [ "The Xarray way is to use [xarray.DataArray.quantile](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.quantile.html), however for a dataset chunked along the time axis this will give an error message:" ] }, { "cell_type": "code", "execution_count": 3, "id": "70ae75b4-08b5-4b2e-be40-d40d2765eb77", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.\n" ] } ], "source": [ "try:\n", " da.quantile(0.9, 'time')\n", "except Exception as e:\n", " print('Error:', e)" ] }, { "cell_type": "markdown", "id": "c22526f7-044e-4643-94e8-9dfc400aa453", "metadata": {}, "source": [ "To get this to work we must rechunk the data so it is not chunked on the time axis, a very expensive operation for large datasets where variables are split into multiple files for different years." ] }, { "cell_type": "code", "execution_count": 4, "id": "85ef8a8b-aed5-4e9b-944f-b78e9df78d6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'temperature' (lat: 50, lon: 100)>\n",
       "dask.array<getitem, shape=(50, 100), dtype=float64, chunksize=(25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat       (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n",
       "  * lon       (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n",
       "    quantile  float64 0.9
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n", " quantile float64 0.9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da.chunk({'time': None}).quantile(0.9, 'time')" ] }, { "cell_type": "markdown", "id": "9df35f0b-4e1a-4d2e-bea3-495aa72fd7a3", "metadata": {}, "source": [ "Dask has an approximate percentile operation that works on data chunked along time, however that will only run on one dimensional data" ] }, { "cell_type": "code", "execution_count": 5, "id": "9a522fcf-fe18-4fe5-973d-035cd0d8f59f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 8 B 8 B
Shape (1,) (1,)
Count 71 Tasks 1 Chunks
Type float64 numpy.ndarray
\n", "
\n", "\n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 1\n", " 1\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask.array.percentile(da.data[:, 30, 60], 90)" ] }, { "cell_type": "markdown", "id": "be339f60-be3d-4920-a9a1-d207f19eb0ad", "metadata": {}, "source": [ "[climtas.blocked.approx_percentile](api/blocked.rst#climtas.blocked.approx_percentile) extends the Dask parallel approximate percentile calculation to multi-dimensional datasets, so large datasets don't need to be rechunked.\n", "\n", "Note that this is an approximation to the true percentile, check it behaves appropriately for your dataset by comparing `approx_percentile` and `numpy.percentile` on a subset of the data." ] }, { "cell_type": "code", "execution_count": 6, "id": "6a363b0a-8097-4277-b537-22c26649e4ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'temperature' (percentile: 1, lat: 50, lon: 100)>\n",
       "dask.array<_merge_approx_percentile, shape=(1, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat         (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n",
       "  * lon         (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n",
       "  * percentile  (percentile) int64 90
" ], "text/plain": [ "\n", "dask.array<_merge_approx_percentile, shape=(1, 50, 100), dtype=float64, chunksize=(1, 25, 25), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -86.33 -82.65 -78.98 ... 82.65 86.33 90.0\n", " * lon (lon) float64 -180.0 -176.4 -172.8 -169.2 ... 169.2 172.8 176.4\n", " * percentile (percentile) int64 90" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "climtas.approx_percentile(da, 90, 'time')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }