.. _CreateTrainingData:

==============================================================================
Create Training Data
==============================================================================

.. _Ufs2ArcoOverview:

ufs2arco Overview
------------------------------------------------------------------------------

:term:`EAGLE` uses :term:`ufs2arco` to generate training, validation, and test datasets.
The ``ufs2arco`` package preprocesses weather data and writes it in a
:term:`Zarr` format suitable for machine learning workflows.
At a high level, the ufs2arco pipeline loads and transforms raw meteorological
data into an Analysis Ready, Cloud Optimized (ARCO) Zarr format.

The workflow is built around three key components:

* Data sources: input datasets from systems such as NOAA :term:`GFS` and
  :term:`HRRR`, or other forecast and reanalysis archives
* Transforms: user-defined processing steps such as regridding and subsetting
* Targets: output data stored in Zarr format

  * ``base``: a general format for scientific analysis with clear variables
    and dimensions
  * ``anemoi``: a layout tailored for machine learning workflows, compatible
    with the anemoi framework

Overall, ufs2arco enables flexible, scalable, and fast preparation of large
meteorological datasets for both research and machine learning workflows.

To begin, create a :term:`YAML` recipe file named ``recipe.yaml``. A simplified
example is shown below:

.. code-block:: yaml

    mover:
      name: mpidatamover

    directories:
      zarr: hrrr.zarr
      cache: cache
      logs: logs

    source:
      name: aws_hrrr_archive
      t0:
        start: 2022-01-01T06
        end: 2022-12-31T18
        freq: 6h

      fhr:
        start: 0
        end: 0
        step: 6

      variables:
        - gh
        - u
        - v
        - t
        - u10
        - v10
        - t2m

      levels:
        - 500
        - 850

    target:
      name: anemoi
      sort_channels_by_levels: true
      compute_temporal_residual_statistics: true
      statistics_period:
        start: 2022-01-01T06
        end: 2022-12-31T18
      forcings:
        - cos_latitude
        - sin_latitude
        - cos_longitude
        - sin_longitude

    chunks:
      time: 1
      variable: -1
      ensemble: 1
      cell: -1

Next, run:

.. code-block:: bash

    ufs2arco recipe.yaml

For more information, see the `ufs2arco documentation <https://ufs2arco.readthedocs.io/en/latest/>`_.

``ufs2arco`` was developed by Tim Smith at NOAA Physical Sciences Laboratory.

.. _Ufs2ArcoTips:

ufs2arco Quick Tips
------------------------------------------------------------------------------

.. _ChooseDates:

Choosing Dates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update the dates to include in your dataset by modifying the ``t0`` block in
your recipe. These dates should cover all data that you plan to use for
training, validation, and testing. The full dataset can be split into those
subsets later.

.. code-block:: yaml

    t0:
      start: 2022-01-01T06
      end: 2022-12-31T18
      freq: 6h

Then ensure that the ``statistics_period`` block is also updated as needed:

.. code-block:: yaml

    statistics_period:
      start: 2022-01-01T06
      end: 2022-10-31T18

As a best practice, keep the statistics period limited to the dates used for
the training dataset.

.. _ChangeVariables:

Changing Variables and Levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To change the variables or vertical levels in the dataset, add or remove items
in the ``source`` block of ``recipe.yaml``. See the `ufs2arco documentation
<https://ufs2arco.readthedocs.io/en/latest/>`_ for the supported variables and
configuration details.

Using a Different Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

EAGLE expects training and inference data to be available in a format compatible
with the Anemoi data stack. There are two common ways to use a different
dataset:

* Generate a new Zarr dataset with ``ufs2arco`` by updating the appropriate
  ``zarrs.<source>.zarr.ufs2arco`` config block. In most cases this means
  changing the source archive, date range, variables, levels, transforms, and
  target Zarr path.
* Point EAGLE at an existing compatible Zarr dataset by updating the training
  dataloader dataset path and the inference input dataset settings in the
  composed EAGLE config.

When changing datasets, make sure the variable names, vertical levels, grid,
time frequency, forcing fields, and normalization/statistics settings are
consistent with the model configuration and any checkpoint used for inference.

.. _MPIUsage:

MPI Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``ufs2arco`` can use :term:`MPI` to parallelize data preprocessing. If you do
not want to use MPI, update the ``mover`` block as follows:

.. code-block:: yaml

    mover:
      name: datamover
      batch_size: 2

.. _DifferentDatasets:

Using Different Datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can use different datasets in EAGLE by changing the ``source`` section of
your ufs2arco recipe while keeping the same overall workflow.

Typical workflow:

#. Choose a supported source dataset (for example, GFS, HRRR, GEFS, AORC, or
   ERA5 reanalysis).
#. Update ``source`` settings (such as dates, variables, and levels).
#. Run ``ufs2arco recipe.yaml`` to generate the Zarr dataset.
#. Point your EAGLE configuration to that dataset for downstream pipeline steps.

A minimal example recipe is shown below:

.. code-block:: yaml

    mover:
      name: datamover
      batch_size: 2

    directories:
      zarr: gefs.zarr
      cache: cache
      logs: logs

    source:
      name: gefs_archive  # example GEFS source
      t0:
        start: 2024-01-01T00
        end: 2024-01-10T00
        freq: 6h
      fhr:
        start: 0
        end: 0
        step: 6
      variables: [gh, u, v, t, q]
      levels: [500, 850]

    target:
      name: anemoi

For source-specific options and examples, see:

* `ufs2arco documentation <https://ufs2arco.readthedocs.io/en/latest/>`_