1. Create Training Data

1.1. ufs2arco Overview

EAGLE uses ufs2arco to generate training, validation, and test datasets. The ufs2arco package preprocesses weather data and writes it in a Zarr format suitable for machine learning workflows. At a high level, the ufs2arco pipeline loads and transforms raw meteorological data into an Analysis Ready, Cloud Optimized (ARCO) Zarr format.

The workflow is built around three key components:

  • Data sources: input datasets from systems such as NOAA GFS and HRRR, or other forecast and reanalysis archives

  • Transforms: user-defined processing steps such as regridding and subsetting

  • Targets: output data stored in Zarr format

    • base: a general format for scientific analysis with clear variables and dimensions

    • anemoi: a layout tailored for machine learning workflows, compatible with the anemoi framework

Overall, ufs2arco enables flexible, scalable, and fast preparation of large meteorological datasets for both research and machine learning workflows.

To begin, create a YAML recipe file named recipe.yaml. A simplified example is shown below:

mover:
  name: mpidatamover

directories:
  zarr: hrrr.zarr
  cache: cache
  logs: logs

source:
  name: aws_hrrr_archive
  t0:
    start: 2022-01-01T06
    end: 2022-12-31T18
    freq: 6h

  fhr:
    start: 0
    end: 0
    step: 6

  variables:
    - gh
    - u
    - v
    - t
    - u10
    - v10
    - t2m

  levels:
    - 500
    - 850

target:
  name: anemoi
  sort_channels_by_levels: true
  compute_temporal_residual_statistics: true
  statistics_period:
    start: 2022-01-01T06
    end: 2022-12-31T18
  forcings:
    - cos_latitude
    - sin_latitude
    - cos_longitude
    - sin_longitude

chunks:
  time: 1
  variable: -1
  ensemble: 1
  cell: -1

Next, run:

ufs2arco recipe.yaml

For more information, see the ufs2arco documentation.

ufs2arco was developed by Tim Smith at NOAA Physical Sciences Laboratory.

1.2. ufs2arco Quick Tips

1.2.1. Choosing Dates

Update the dates to include in your dataset by modifying the t0 block in your recipe. These dates should cover all data that you plan to use for training, validation, and testing. The full dataset can be split into those subsets later.

t0:
  start: 2022-01-01T06
  end: 2022-12-31T18
  freq: 6h

Then ensure that the statistics_period block is also updated as needed:

statistics_period:
  start: 2022-01-01T06
  end: 2022-10-31T18

As a best practice, keep the statistics period limited to the dates used for the training dataset.

1.2.2. Changing Variables and Levels

To change the variables or vertical levels in the dataset, add or remove items in the source block of recipe.yaml. See the ufs2arco documentation for the supported variables and configuration details.

1.2.3. Using a Different Dataset

EAGLE expects training and inference data to be available in a format compatible with the Anemoi data stack. There are two common ways to use a different dataset:

  • Generate a new Zarr dataset with ufs2arco by updating the appropriate zarrs.<source>.zarr.ufs2arco config block. In most cases this means changing the source archive, date range, variables, levels, transforms, and target Zarr path.

  • Point EAGLE at an existing compatible Zarr dataset by updating the training dataloader dataset path and the inference input dataset settings in the composed EAGLE config.

When changing datasets, make sure the variable names, vertical levels, grid, time frequency, forcing fields, and normalization/statistics settings are consistent with the model configuration and any checkpoint used for inference.

1.2.4. MPI Usage

ufs2arco can use MPI to parallelize data preprocessing. If you do not want to use MPI, update the mover block as follows:

mover:
  name: datamover
  batch_size: 2

1.2.5. Using Different Datasets

You can use different datasets in EAGLE by changing the source section of your ufs2arco recipe while keeping the same overall workflow.

Typical workflow:

  1. Choose a supported source dataset (for example, GFS, HRRR, GEFS, AORC, or ERA5 reanalysis).

  2. Update source settings (such as dates, variables, and levels).

  3. Run ufs2arco recipe.yaml to generate the Zarr dataset.

  4. Point your EAGLE configuration to that dataset for downstream pipeline steps.

A minimal example recipe is shown below:

mover:
  name: datamover
  batch_size: 2

directories:
  zarr: gefs.zarr
  cache: cache
  logs: logs

source:
  name: gefs_archive  # example GEFS source
  t0:
    start: 2024-01-01T00
    end: 2024-01-10T00
    freq: 6h
  fhr:
    start: 0
    end: 0
    step: 6
  variables: [gh, u, v, t, q]
  levels: [500, 850]

target:
  name: anemoi

For source-specific options and examples, see: