physicsnemo 📚[DOC]: Request for Dataset Download Source and Formatting Instructions for StormCast

How would you describe the priority of this documentation request

Critical (currently preventing usage)

Is this for new documentation, or an update to existing docs?

Update

Describe the incorrect/future/missing documentation

Modulus Version

1.0.0

Environment

OS: Ubuntu 18.04 CPU: Intel Xeon E5-2650 GPU: NVIDIA Quadro RTX 6000/8000 (Driver: NVIDIA 535.146.02)

Issue Description

Hi team,

First of all, thank you for providing this amazing library!

We are currently following one of the examples from this repository, specifically the stormcast example in https://github.com/NVIDIA/physicsnemo/tree/main/examples/generative/stormcast. We would like to reproduce the StormCast results (particularly train.py), but we couldn't find detailed dataset installation instructions.

We've known that how to put datasets specified in https://github.com/NVIDIA/physicsnemo/blob/main/examples/generative/stormcast/utils/data_loader_hrrr_era5.py:

class HrrrEra5Dataset(Dataset):
    """
    Paired dataset object serving time-synchronized pairs of ERA5 and HRRR samples
    Expects data to be stored under directory specified by 'location' with the
    following layout:

    | <location>
    | -- era5
         | -- stats
              | -- means.npy
              | -- stds.npy
              | -- time_means.npy
         | -- valid
         | -- test
         | -- train
    | -- <params.conus_dataset_name>
         | invariants.zarr
         | -- <params.hrrr_stats>
              | -- means.npy
              | -- stds.npy
         | -- valid
         | -- test
         | -- train

    ERA5 stored under <location>/era5/
    HRRR stored under <location>/<params.conus_dataset_name>

    Within each train/valid/test directory, there should be one zarr file per
    year containing the data of interest.
    """

Although the websites of the dataset were provided in README.md

Dataset
In this example, StormCast is trained on the HRRR dataset, 
conditioned on the ERA5 dataset. 
The datapipe in this example is tailored specifically 
for the domain and problem setting posed in the original 
StormCast preprint, namely a subset of HRRR and ERA5 variables
in a region over the Central US with spatial extent 1536km x 1920km.

We do not know:

How to download the ERA5 and HRRR dataset to generate the same structure described in the data_loader?
How to transform them into the exact data format used by stormcast?

It would be very helpful for us if more detailed instructions were provided for those questions.

Thanks in advance!

Best regards, tang0214

Mar 22 '25 03:03 tang0214

Hello,

Thanks for raising this issue. It is still on our roadmap to provide more complete dataset preparation instructions for this example, but at the moment I can't give you a specific timeline on that.

That said, in the meantime I wanted to at least point you to earth2studio, which has data sources for HRRR and ERA5 defined. You should be able to write a script to download variables and prepare the data accordingly using earth2studio.data utilities. You can also check out the StormCast example there for a reference inference workflow.

May 02 '25 18:05 pzharrington

Hi Thank you for your reply.

We’ve been working on this issue for two months, but we’re still stuck on running the StormCast training example. The main difficulty lies in preprocessing the raw ERA5 and HRRR datasets downloaded from Earth2Studio, as the current documentation doesn’t clearly explain the required steps.

It would be greatly appreciated if more complete and detailed instructions for dataset preparation could be provided for StormCast.

Thanks again for your support!

May 18 '25 17:05 tang0214

Hi @pzharrington , I'm working with @tang0214 on the same issue. Thanks a lot with the recent PR#880, we saw the team made some progress. It really makes difference.

In the new README,

Adding custom datasets While it is possible to train StormCast on custom datasets by formatting them indentically to the Zarr datasets used in the ERA5-HRRR example, a more flexible option is to define a custom dataset object. These datasets must follow the StormCastDataset interface defined in datasets/base.py; see the docstrings in that file for a specification of what the functions must accept and return. You can use the datasets/data_loader_hrrr_era5.py implementation as an example.

is "datasets/base.py" same as "datasets/ _ init _.py"?

May 20 '25 09:05 running-berry

Hi @running-berry, thanks for highlighting this issue. The README has a typo, the correct file that specifies the base dataset class is actuallydatasets/dataset.py. We will update this. @jleinonen for viz

May 20 '25 18:05 pzharrington