03 - Extracting forcing data

Extracting meteorological data for a selected watershed

Using a GeoJSON file extracted from the HydroSHEDS database or given by the user, meteorological datasets can be extracted inside the watershed’s boundaries using the PAVICS-Hydro ERA5 database.

import datetime as dt
import tempfile
from pathlib import Path

import fsspec  # noqa
import intake
import numpy as np
import s3fs  # noqa
import xarray as xr
from clisops.core import subset

from ravenpy.utilities.testdata import get_file

If we want to extract data for our watershed, we need to know:

  • The spatial extent (as defined by the watershed boundaries);

  • The temporal extent (as defined by the start and end days of the period of interest).

Let’s define those now:

# This will be our input section, where we control what we want to extract.
# We know which watershed interests us, it is the input.geojson file that we previously generated!

# The contour can be generated using notebook "01_Delineating watersheds, where it would be placed
# in the same folder as the notebooks and available in your workspace. The contour could then be accessed
# easily by defining it as follows:
"""
basin_contour = "input.geojson"
"""
# However, to keep things tidy, we have also prepared a version that can be accessed easily for
# demonstration purposes:
basin_contour = get_file("notebook_inputs/input.geojson")

# Also, we can specify which timeframe we want to extract. Here let's focus on a 10-year period
reference_start_day = dt.datetime(1985, 12, 31)
reference_stop_day = dt.datetime(1987, 1, 1)
# Notice we are using one day before and one day after the desired period of 1986-01-01 to 1986-12-31.
# This is to account for any UTC shifts that might require getting data in a previous or later time.

We now provide a means to get some data to run our model. Typically, models will require precipitation and temperature data, so let’s get that data. We will use a generally reliable dataset that is available everywhere to minimize missing values: the ERA5 Reanalysis.

The code block below gathers the required data automatically. If you need other data or want to use another source, this cell will need to be replaced for your customized needs.

# Get the ERA5 data from the Wasabi/Amazon S3 server.
catalog_name = "https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/atmosphere.yaml"
cat = intake.open_catalog(catalog_name)
ds = cat.era5_reanalysis_single_levels.to_dask()

Get the ERA5 data. We will rechunk it to a single chunk to make it compatible with other codes on the platform, especially bias-correction. We are also taking the daily min and max temperatures as well as the daily total precipitation.

# We will add a wrapper to ensure that the following operations will preserve the original data attributes, such as units and variable names.
with xr.set_options(keep_attrs=True):
    ERA5_reference = subset.subset_shape(
        ds.sel(time=slice(reference_start_day, reference_stop_day)), basin_contour
    )
    ERA5_tas = ERA5_reference["t2m"].resample(time="1D")
    ERA5_tmin = ERA5_tas.min().chunk(-1, -1, -1)
    ERA5_tmax = ERA5_tas.max().chunk(-1, -1, -1)
    ERA5_pr = ERA5_reference["tp"].resample(time="1D").sum().chunk(-1, -1, -1)

ERA5_pr

We can now convert these variables to the desired format and save them to disk in netcdf files to use at a later time (in a future notebook!)

First, we will want to make sure that the units we are working with are compatible with the Raven modelling framework. We will want precipitation to be in mm (per time period, here we are working daily so it will be in mm/day), and temperatures will be in °C. Let’s check out the current units:

print(f"Tmin units: {ERA5_tmin.units}")
print(f"Tmax units: {ERA5_tmax.units}")
print(f"Precipitation units: {ERA5_pr.units}")

We can see that the units are in Kelvin for temperatures and in meters for precipitation. We will want to do some conversions!

Let’s start by applying offsets for temperatures and a conversion factor for precipitation:

with xr.set_options(keep_attrs=True):
    ERA5_tmin = ERA5_tmin - 273.15  # K to °C
    ERA5_tmin.attrs["units"] = "degC"

    ERA5_tmax = ERA5_tmax - 273.15  # K to °C
    ERA5_tmax.attrs["units"] = "degC"

    ERA5_pr = ERA5_pr * 1000  # m to mm
    ERA5_pr.attrs["units"] = "mm"

We can see the changes now by re-inspecting the datasets:

print(f"Tmin units: {ERA5_tmin.units}")
print(f"Tmax units: {ERA5_tmax.units}")
print(f"Precipitation units: {ERA5_pr.units}")

So let’s write them to disk for now. We will use the netcdf format as this is what Raven uses for inputs. It is possible you will get some warnings, this is OK and should not cause any problems. Since our model will run in lumped mode, we will average the spatial dimensions of each variable over the domain.

with xr.set_options(keep_attrs=True):
    # Average the variables
    ERA5_tmin = ERA5_tmin.mean({"latitude", "longitude"})
    ERA5_tmax = ERA5_tmax.mean({"latitude", "longitude"})
    ERA5_pr = ERA5_pr.mean({"latitude", "longitude"})

    # Ensure that the precipitation is non-negative, which can happen with some reanalysis models.
    ERA5_pr = np.maximum(ERA5_pr, 0)

    # Transform them to a dataset such that they can be written with attributes to netcdf
    ERA5_tmin = ERA5_tmin.to_dataset(name="tmin", promote_attrs=True)
    ERA5_tmax = ERA5_tmax.to_dataset(name="tmax", promote_attrs=True)
    ERA5_pr = ERA5_pr.to_dataset(name="pr", promote_attrs=True)
# Check and see if the precipitation makes sense:
ERA5_pr.pr.plot()

Here we will write the files to disk in a temporary folder since the root folder containing these notebooks is read-only.

You can change the path here to your own preferred path in your writable workspace. Alternatively, if you copy this notebook to your writable-workspace as shown in the introduction documentation, you can save just the filename (no absolute path) and the file will appear “beside” the notebooks, ready to be read by the next series of notebooks.

For this case, you will want to use the second provided option in which all variables are stored in the same netcdf file. This will make the data much easier to find for Raven and prevent some errors, such as if Raven needs to calculate daily average temperature. This will be possible with all variables in the same file, but will cause an error if max and min temperature are in two seperate files.

# Option 1, which is not recommended to use in other notebooks but can be really useful in various other workflows:
with xr.set_options(keep_attrs=True):
    # Write to disk.
    tmp = Path(tempfile.mkdtemp())
    ERA5_tmin.to_netcdf(tmp / "ERA5_tmin.nc")
    ERA5_tmax.to_netcdf(tmp / "ERA5_tmax.nc")
    ERA5_pr.to_netcdf(tmp / "ERA5_pr.nc")
# Option 2, which is recommended, in which we prepare a single file that merges all three variables into one netcdf file:
with xr.set_options(keep_attrs=True):
    xr.merge([ERA5_tmin, ERA5_tmax, ERA5_pr]).to_netcdf(tmp / "ERA5_weather_data.nc")

We now have daily precipitation and minimum/maximum temperatures to drive our Raven Model, which we will do in the next notebook!

Note that our dataset generated here is very short (1 year) but the same dataset for the period 1980-12-31 to 1991-01-01 has been pre-generated and stored on the server for efficiency.