1. Create Training Data
1.1. ufs2arco Overview
EAGLE uses ufs2arco to generate training, validation, and test datasets.
The ufs2arco package preprocesses weather data and writes it in a
Zarr format suitable for machine learning workflows.
At a high level, the ufs2arco pipeline loads and transforms raw meteorological
data into an Analysis Ready, Cloud Optimized (ARCO) Zarr format.
The workflow is built around three key components:
Data sources: input datasets from systems such as NOAA GFS and HRRR, or other forecast and reanalysis archives
Transforms: user-defined processing steps such as regridding and subsetting
Targets: output data stored in Zarr format
base: a general format for scientific analysis with clear variables and dimensionsanemoi: a layout tailored for machine learning workflows, compatible with the anemoi framework
Overall, ufs2arco enables flexible, scalable, and fast preparation of large meteorological datasets for both research and machine learning workflows.
To begin, create a YAML recipe file named recipe.yaml. A simplified
example is shown below:
mover:
name: mpidatamover
directories:
zarr: hrrr.zarr
cache: cache
logs: logs
source:
name: aws_hrrr_archive
t0:
start: 2022-01-01T06
end: 2022-12-31T18
freq: 6h
fhr:
start: 0
end: 0
step: 6
variables:
- gh
- u
- v
- t
- u10
- v10
- t2m
levels:
- 500
- 850
target:
name: anemoi
sort_channels_by_levels: true
compute_temporal_residual_statistics: true
statistics_period:
start: 2022-01-01T06
end: 2022-12-31T18
forcings:
- cos_latitude
- sin_latitude
- cos_longitude
- sin_longitude
chunks:
time: 1
variable: -1
ensemble: 1
cell: -1
Next, run:
ufs2arco recipe.yaml
For more information, see the ufs2arco documentation.
ufs2arco was developed by Tim Smith at NOAA Physical Sciences Laboratory.
1.2. ufs2arco Quick Tips
1.2.1. Choosing Dates
Update the dates to include in your dataset by modifying the t0 block in
your recipe. These dates should cover all data that you plan to use for
training, validation, and testing. The full dataset can be split into those
subsets later.
t0:
start: 2022-01-01T06
end: 2022-12-31T18
freq: 6h
Then ensure that the statistics_period block is also updated as needed:
statistics_period:
start: 2022-01-01T06
end: 2022-10-31T18
As a best practice, keep the statistics period limited to the dates used for the training dataset.
1.2.2. Changing Variables and Levels
To change the variables or vertical levels in the dataset, add or remove items
in the source block of recipe.yaml. See the ufs2arco documentation for the supported variables and
configuration details.
1.2.3. Using a Different Dataset
EAGLE expects training and inference data to be available in a format compatible with the Anemoi data stack. There are two common ways to use a different dataset:
Generate a new Zarr dataset with
ufs2arcoby updating the appropriatezarrs.<source>.zarr.ufs2arcoconfig block. In most cases this means changing the source archive, date range, variables, levels, transforms, and target Zarr path.Point EAGLE at an existing compatible Zarr dataset by updating the training dataloader dataset path and the inference input dataset settings in the composed EAGLE config.
When changing datasets, make sure the variable names, vertical levels, grid, time frequency, forcing fields, and normalization/statistics settings are consistent with the model configuration and any checkpoint used for inference.
1.2.4. MPI Usage
ufs2arco can use MPI to parallelize data preprocessing. If you do
not want to use MPI, update the mover block as follows:
mover:
name: datamover
batch_size: 2
1.2.5. Using Different Datasets
You can use different datasets in EAGLE by changing the source section of
your ufs2arco recipe while keeping the same overall workflow.
Typical workflow:
Choose a supported source dataset (for example, GFS, HRRR, GEFS, AORC, or ERA5 reanalysis).
Update
sourcesettings (such as dates, variables, and levels).Run
ufs2arco recipe.yamlto generate the Zarr dataset.Point your EAGLE configuration to that dataset for downstream pipeline steps.
A minimal example recipe is shown below:
mover:
name: datamover
batch_size: 2
directories:
zarr: gefs.zarr
cache: cache
logs: logs
source:
name: gefs_archive # example GEFS source
t0:
start: 2024-01-01T00
end: 2024-01-10T00
freq: 6h
fhr:
start: 0
end: 0
step: 6
variables: [gh, u, v, t, q]
levels: [500, 850]
target:
name: anemoi
For source-specific options and examples, see: