Data Loaders¶

Base¶

class weatherbenchX.data_loaders.base.DataLoader(interpolation: Interpolation | None = None, compute: bool = True, add_nan_mask: bool = False, process_chunk_fn: Callable[[Mapping[Hashable, DataArray]], Mapping[Hashable, DataArray]] | None = None, add_values_to_coords: bool = False)[source]¶

Base class for data loaders.

Data loaders return chunks of data compatible with the rest of the evaluation framework. Specifically, this should be an xr.Dataset or a dictionary of xr.DataArray’s. It is the data loaders’ job to return target and prediction chunks that can be broadcast against each other. If interpolation is required to map one dataset to another, e.g. interpolating a gridded dataset to sparse points, a reference dataset can be provided for this purpose.

Shared initialization for data loaders.

Parameters:

interpolation – (Optional) Interpolation to be applied to the data.
compute – Load chunk into memory. Default: True.
add_nan_mask – Adds a boolean coordinate named ‘mask’ to each variable (variables will be split into DataArrays if they aren’t already), with False indicating NaN values. To be used for masked aggregation. Default: False.
process_chunk_fn – optional function to be applied to each chunk after loading but before interpolation, computing, and adding nan mask.
add_values_to_coords – If True, add returned values to coordinates. These will propagate into the statistics, and can therefore be used for binning. Default: False.

Xarray Data Loaders¶

class weatherbenchX.data_loaders.xarray_loaders.XarrayDataLoader(path: str | None = None, ds: Dataset | None = None, variables: Iterable[str] | None = None, sel_kwargs: Mapping[str, Any] | None = None, rename_dimensions: Mapping[str, str] | str | None = 'ecmwf', automatically_convert_lat_lon_to_latitude_longitude: bool = True, rename_variables: Mapping[str, str] | None = None, preprocessing_fn: Callable[[Dataset], Dataset] | None = None, **kwargs)[source]¶

Base class for Xarray data loaders.

Init.

Parameters:

path – (Optional) Path to xarray dataset to open. If it ends with ‘.zarr’, it is opened using xr.open_zarr. Otherwise, it is opened using xr.open_dataset.
ds – (Optional) Already opened xarray dataset. Either path or ds must be specified.
variables – (Optional) List of variables to load (after renaming). Default: Load all variables.
sel_kwargs – (Optional) Keyword arguments to pass to .sel() after renaming.
rename_dimensions – (Optional) Dictionary of dimensions to rename. The data loaders expect the following time dimensions: init_time and lead_time for a forecast dataset; valid_time for target datasets (e.g. reanalyses). rename_dimensions=’ecmwf’ (default) assumes ECMWF standard names, {‘time’: ‘init_time’, ‘prediction_timedelta’: ‘lead_time’} for prediction datasets and {‘time’: ‘valid_time’} for analysis datasets.
automatically_convert_lat_lon_to_latitude_longitude – (Optional) Whether to automatically convert ‘lat’ and ‘lon’ dimensions to ‘latitude’ and ‘longitude’. Default: True.
rename_variables – (Optional) Dictionary of variables to rename.
preprocessing_fn – (Optional) A function that is applied to the dataset right after it is opened.
**kwargs – Keyword arguments to pass to base.DataLoader.

class weatherbenchX.data_loaders.xarray_loaders.PredictionsFromXarray(path: str | None = None, ds: Dataset | None = None, variables: Iterable[str] | None = None, sel_kwargs: Mapping[str, Any] | None = None, rename_dimensions: Mapping[str, str] | str | None = 'ecmwf', automatically_convert_lat_lon_to_latitude_longitude: bool = True, rename_variables: Mapping[str, str] | None = None, preprocessing_fn: Callable[[Dataset], Dataset] | None = None, **kwargs)[source]¶

Data loader for reading prediction datasets from Xarray.

Example

>>> init_times, lead_times
(array(['2020-01-01T00:00:00.000000000', '2020-01-01T12:00:00.000000000'],
 dtype='datetime64[ns]'), array([0, 6], dtype='timedelta64[h]'))
>>> variables = ['2m_temperature', '10m_wind_speed']
>>> prediction_data_loader = PredictionsFromXarray(
>>>     path=<PATH>,
>>>     variables=variables,
>>> )
>>> prediction_data_loader.load_chunk(init_times, lead_times)
<xarray.Dataset>
Dimensions:         (latitude: 32, longitude: 64, lead_time: 2, init_time:
2)
Coordinates:
  * latitude        (latitude) float64 -87.19 -81.56 -75.94 ... 81.56
  87.19
  * longitude       (longitude) float64 0.0 5.625 11.25 ... 343.1 348.8
  354.4
  * lead_time       (lead_time) timedelta64[ns] 00:00:00 06:00:00
  * init_time       (init_time) datetime64[ns] 2020-01-01
  2020-01-01T12:00:00
Data variables:
    10m_wind_speed  (init_time, lead_time, longitude, latitude) float32
    2.29 ...
    2m_temperature  (init_time, lead_time, longitude, latitude) float32
    247.4...

Init.

Parameters:

path – (Optional) Path to xarray dataset to open. If it ends with ‘.zarr’, it is opened using xr.open_zarr. Otherwise, it is opened using xr.open_dataset.
ds – (Optional) Already opened xarray dataset. Either path or ds must be specified.
variables – (Optional) List of variables to load (after renaming). Default: Load all variables.
sel_kwargs – (Optional) Keyword arguments to pass to .sel() after renaming.
rename_dimensions – (Optional) Dictionary of dimensions to rename. The data loaders expect the following time dimensions: init_time and lead_time for a forecast dataset; valid_time for target datasets (e.g. reanalyses). rename_dimensions=’ecmwf’ (default) assumes ECMWF standard names, {‘time’: ‘init_time’, ‘prediction_timedelta’: ‘lead_time’} for prediction datasets and {‘time’: ‘valid_time’} for analysis datasets.
automatically_convert_lat_lon_to_latitude_longitude – (Optional) Whether to automatically convert ‘lat’ and ‘lon’ dimensions to ‘latitude’ and ‘longitude’. Default: True.
rename_variables – (Optional) Dictionary of variables to rename.
preprocessing_fn – (Optional) A function that is applied to the dataset right after it is opened.
**kwargs – Keyword arguments to pass to base.DataLoader.

class weatherbenchX.data_loaders.xarray_loaders.TargetsFromXarray(path: str | None = None, ds: Dataset | None = None, variables: Iterable[str] | None = None, sel_kwargs: Mapping[str, Any] | None = None, rename_dimensions: Mapping[str, str] | str | None = 'ecmwf', automatically_convert_lat_lon_to_latitude_longitude: bool = True, rename_variables: Mapping[str, str] | None = None, preprocessing_fn: Callable[[Dataset], Dataset] | None = None, **kwargs)[source]¶

Data loader for reading target datasets from Xarray.

Example

>>> init_times, lead_times
(array(['2020-01-01T00:00:00.000000000', '2020-01-01T12:00:00.000000000'],
dtype='datetime64[ns]'), array([0, 6], dtype='timedelta64[h]'))
>>> variables = ['2m_temperature', '10m_wind_speed']
>>> target_data_loader = gridded_zarr.TargetsFromXarray(
>>>     path=<PATH>,
>>>     variables=variables,
>>> )
>>> target_data_loader.load_chunk(init_times, lead_times)
<xarray.Dataset>
Dimensions:         (latitude: 32, longitude: 64, init_time: 2, lead_time:
2)
Coordinates:
  * latitude        (latitude) float64 -87.19 -81.56 -75.94 ... 81.56
  87.19
  * longitude       (longitude) float64 0.0 5.625 11.25 ... 343.1 348.8
  354.4
    valid_time      (init_time, lead_time) datetime64[ns] 2020-01-01 ...
    2020...
  * init_time       (init_time) datetime64[ns] 2020-01-01
  2020-01-01T12:00:00
  * lead_time       (lead_time) timedelta64[ns] 00:00:00 06:00:00
Data variables:
    10m_wind_speed  (init_time, lead_time, longitude, latitude) float32
    2.221...
    2m_temperature  (init_time, lead_time, longitude, latitude) float32
    248.5...

Init.

Parameters:

path – (Optional) Path to xarray dataset to open. If it ends with ‘.zarr’, it is opened using xr.open_zarr. Otherwise, it is opened using xr.open_dataset.
ds – (Optional) Already opened xarray dataset. Either path or ds must be specified.
variables – (Optional) List of variables to load (after renaming). Default: Load all variables.
sel_kwargs – (Optional) Keyword arguments to pass to .sel() after renaming.
rename_dimensions – (Optional) Dictionary of dimensions to rename. The data loaders expect the following time dimensions: init_time and lead_time for a forecast dataset; valid_time for target datasets (e.g. reanalyses). rename_dimensions=’ecmwf’ (default) assumes ECMWF standard names, {‘time’: ‘init_time’, ‘prediction_timedelta’: ‘lead_time’} for prediction datasets and {‘time’: ‘valid_time’} for analysis datasets.
automatically_convert_lat_lon_to_latitude_longitude – (Optional) Whether to automatically convert ‘lat’ and ‘lon’ dimensions to ‘latitude’ and ‘longitude’. Default: True.
rename_variables – (Optional) Dictionary of variables to rename.
preprocessing_fn – (Optional) A function that is applied to the dataset right after it is opened.
**kwargs – Keyword arguments to pass to base.DataLoader.

class weatherbenchX.data_loaders.xarray_loaders.ClimatologyFromXarray(climatology_time_coords: Iterable[str] = ('dayofyear', 'hour'), rename_dimensions: Mapping[str, str] | str | None = None, **kwargs)[source]¶

Reads a climatology dataset as a predictions dataset.

Init.

Parameters:

climatology_time_coords – The time coordinates of the climatology dataset to select. Default: (‘dayofyear’, ‘hour’).
rename_dimensions – (Optional) Dictionary of dimensions to rename. Default: None.
**kwargs – Other arguments to pass to XarrayDataLoader.

class weatherbenchX.data_loaders.xarray_loaders.PersistenceFromXarray(path: str | None = None, ds: Dataset | None = None, variables: Iterable[str] | None = None, sel_kwargs: Mapping[str, Any] | None = None, rename_dimensions: Mapping[str, str] | str | None = 'ecmwf', automatically_convert_lat_lon_to_latitude_longitude: bool = True, rename_variables: Mapping[str, str] | None = None, preprocessing_fn: Callable[[Dataset], Dataset] | None = None, **kwargs)[source]¶

Reads a target dataset as a prediction dataset by replicating data along lead times.

Init.

Parameters:

path – (Optional) Path to xarray dataset to open. If it ends with ‘.zarr’, it is opened using xr.open_zarr. Otherwise, it is opened using xr.open_dataset.
ds – (Optional) Already opened xarray dataset. Either path or ds must be specified.
variables – (Optional) List of variables to load (after renaming). Default: Load all variables.
sel_kwargs – (Optional) Keyword arguments to pass to .sel() after renaming.
rename_dimensions – (Optional) Dictionary of dimensions to rename. The data loaders expect the following time dimensions: init_time and lead_time for a forecast dataset; valid_time for target datasets (e.g. reanalyses). rename_dimensions=’ecmwf’ (default) assumes ECMWF standard names, {‘time’: ‘init_time’, ‘prediction_timedelta’: ‘lead_time’} for prediction datasets and {‘time’: ‘valid_time’} for analysis datasets.
automatically_convert_lat_lon_to_latitude_longitude – (Optional) Whether to automatically convert ‘lat’ and ‘lon’ dimensions to ‘latitude’ and ‘longitude’. Default: True.
rename_variables – (Optional) Dictionary of variables to rename.
preprocessing_fn – (Optional) A function that is applied to the dataset right after it is opened.
**kwargs – Keyword arguments to pass to base.DataLoader.

class weatherbenchX.data_loaders.xarray_loaders.ProbabilisticClimatologyFromXarray(start_year: int, end_year: int, ensemble_dim: str = 'number', **kwargs)[source]¶

Reads a target dataset and treats every year as an ensemble member.

For each valid_time, take the corresponding value for the same day of the year and hour of the day from the target dataset between start and end year and treat it as an ensemble member.

When querying the last day of a leap year, the loader will return the first day of the following year for non-leap years.

This is used as a probablistic baseline for the WeatherBench website.

Init.

Parameters:

start_year – The first year to include in the climatology.
end_year – The last year (incl.) to include in the climatology.
ensemble_dim – The dimension to use for the ensemble. Default: ‘number’.
**kwargs – Other arguments to pass to XarrayDataLoader.

Sparse Data Loaders¶

class weatherbenchX.data_loaders.sparse_parquet.SparseObservationsFromParquet(path: str, partitioned_by: str, time_dim: str, variables: Sequence[str], coordinate_variables: Sequence[str] = (), split_variables: bool = False, dropna: bool = False, tolerance: timedelta64 | tuple[timedelta64, timedelta64] | None = None, rename_variables: Mapping[str, str] | None = None, include_slice_end_time: bool = False, remove_duplicates: bool = False, pick_closest_duplicate_by: str | None = None, observation_dim: str | None = None, file_tolerance: timedelta64 = np.timedelta64(1, 'h'), preprocessing_fn: Callable[[DataFrame], DataFrame] | None = None, **kwargs)[source]¶

Reads general sparse observation data stored in Parquet format.

It is assumed that the data is partitioned by month, day or hour. A daily partition would follow the following directory structure: <PATH>/year=2020/month=1/day=1/2020-01-01.parquet

Since auto-discovery of files can take a long time, this data loader assumes this format to quickly query the desired sub-files for a given time interval.

Currently, this assumes there are no missing files.

Init.

Parameters:

path – Path to Parquet dataset.
partitioned_by – How the Parquet file is partitioned. ‘hour’, ‘day’ or ‘month’.
time_dim – Time dimension on Parquet files (before renaming) to use for time filtering.
variables – Variables to load (after renaming).
coordinate_variables – Coordinate variables to load. These will be converted to an xarray coordinates. ‘valid_time’ is always a coordinate and represents the original value of the time_dim coordinate. Default: ()
split_variables – Whether to return the loaded data as a dictionary of DataArrays. Default: False.
dropna – Whether to drop missing values. If split_variables is True, values will be dropped for each variable separately. Otherwise, only indices where all variables are non-NaN will be returned.
tolerance – (Optional) Tolerance around the given valid time. If tolerance is a single timedelta, data within valid_time +/- tolerance will be returned. If tolerance is a 2-tuple of timedeltas, data within [valid_time + tolerance[0], valid_time + tolerance[1]] will be returned. This is only supported for exact lead_times. The resulting init and lead time coordinates will be those requested. The valid_time dimension will reflect the original time for each observation.
rename_variables – (Optional) Renaming dictionary.
include_slice_end_time – Whether slice end time is included. Default: False
remove_duplicates – For exact lead times, whether duplicate stations (specified by observation_dim) for the same valid time are removed. If True, this will pick the closest time specified by pick_closest_duplicate_by to the valid_time and keep it. Default: False
pick_closest_duplicate_by – (Optional) Time dimension to use to pick the closest duplicate.
observation_dim – (Optional) Dimension identifying e.g. station names. This is used to remove duplicate observations.
file_tolerance – ‘timeObs’ does not always align with the time on the partition. To make sure all required times are read, open the files with +/- file_tolerance. The ‘timeObs’ of most observations are within a one hour window of the nominal time. ‘timeNominal’ will be equal to the partition time and would therefore not require a file_tolerane. Default: 1h
preprocessing_fn – (Optional) Function to apply to the dataframe after reading.
**kwargs – Additional keyword arguments passed to the base DataLoader.

Latency Wrappers¶

class weatherbenchX.data_loaders.latency_wrappers.ConstantLatencyWrapper(data_loader: DataLoader, latency: timedelta64, nominal_init_times: ndarray, concat_dim: str = 'init_time')[source]¶

Wraps a data loader to adjust init/lead_times based on a constant latency.

Terminology used here:

Nominal init time: Initialization time of raw file(s), or in other words, what the underlying model considers t=0.
Nominal lead time: Lead time of raw file(s).
Latency: Delay from the nominal init time to the time when the forecast would be available in an operational setting.
Issue time: Init time at which the forecast is actually available = nominal init time + latency.
Queried init/lead time: Actual time/lead time a forecast is requested for in an operational setting.

This works by picking the most recently available nominal init time (i.e. init time on file) for the requested init_time given a constant latency.

Lead and init times are adjusted to reflect the latency and _load_chunk_from_source is called with these adjusted values for the given data loader. The returned values are then assigned the requested init/lead times.

Because this has to be done for each init time separately, the results are concatenated. For non-sparse data (i.e. where init_time is data dimension), the concatenation is done along the init_time dimension. For sparse data, where init_time is simply a coordinate, the concatenation is done along the index dimension. This has to be passed explicitly as an argument.

Examples

1. For a latency of 6h with nominal init_times at 00/12UTC, querying an init time of 12UTC and a lead time of 6h, will internally load the 00UTC init time and a lead_time of 12h. The returned data will still have an init_time of 12UTC and a lead_time of 6h.

2. For a forecast initialized at 00/12UTC with a 5 hour latency (meaning issue times of 05/17UTC), querying an init time of 16UTC and a lead time of 1h will internally load the 00UTC nominal init time and a lead time of 16h.

Init.

Parameters:

data_loader – The data loader to wrap.
latency – Constant latency as np.timedelta64 object.
nominal_init_times – A numpy array containing the nominal init times of the predictions for the entire dataset of predictions as numpy datetime64. Example array: np.array([‘2020-01-01T00’, ‘2020-01-01T12’, …])
concat_dim – The dimension to concatenate along. Default: ‘init_time’. Set to ‘index’ for sparse data.

class weatherbenchX.data_loaders.latency_wrappers.XarrayConstantLatencyWrapper(data_loader: XarrayDataLoader, latency: timedelta64, init_time_dim: str = 'init_time', concat_dim: str = 'init_time')[source]¶

Wraps an XarrayDataLoader in a latency wrapper.

This is a shortcut that uses the init_time coordinate on the Zarr file to determine the nominal init times.

Init.

Parameters:

data_loader – The data loader to wrap.
latency – Constant latency as np.timedelta64 object.
nominal_init_times – A numpy array containing the nominal init times of the predictions for the entire dataset of predictions as numpy datetime64. Example array: np.array([‘2020-01-01T00’, ‘2020-01-01T12’, …])
concat_dim – The dimension to concatenate along. Default: ‘init_time’. Set to ‘index’ for sparse data.

class weatherbenchX.data_loaders.latency_wrappers.MultipleConstantLatencyWrapper(data_loaders: list[ConstantLatencyWrapper], concat_dim: str = 'init_time')[source]¶

Extension to multiple data loaders with different nominal init times.

This is to serve the case where e.g. 00/12UTC and 06/18UTC forecasts are stored in different e.g. Zarr files. This data loader then uses the most recent available init time across all data loaders.

It works internally by wrapping load_chunk, determining the wrapped data loader to call and then concatenating the results across init_time. If there is a tie (i.e. multiple underling data loaders that have the same available init time), ties will be broken by picking the data loader with the largest latency, with the assumption that a larger latency implies a larger lookahead.

As for the regular LatencyWrapper, the concatenation is done along the init_time dimension for non-sparse data and along the index dimension for sparse data. However, this has to be explicitly passed as an argument.

One difference to the regular LatencyWrapper is that the concatenation is done using data from .load_chunk() instead of ._load_chunk_from_source() for single latency wrappers. This means that here the data is already interpolated.

Shared initialization for data loaders.

Parameters:

interpolation – (Optional) Interpolation to be applied to the data.
compute – Load chunk into memory. Default: True.
add_nan_mask – Adds a boolean coordinate named ‘mask’ to each variable (variables will be split into DataArrays if they aren’t already), with False indicating NaN values. To be used for masked aggregation. Default: False.
process_chunk_fn – optional function to be applied to each chunk after loading but before interpolation, computing, and adding nan mask.
add_values_to_coords – If True, add returned values to coordinates. These will propagate into the statistics, and can therefore be used for binning. Default: False.