📚[DOC]: Request for Dataset Download Source and Formatting Instructions for StormCast
How would you describe the priority of this documentation request
Critical (currently preventing usage)
Is this for new documentation, or an update to existing docs?
Update
Describe the incorrect/future/missing documentation
Modulus Version
1.0.0
Environment
OS: Ubuntu 18.04 CPU: Intel Xeon E5-2650 GPU: NVIDIA Quadro RTX 6000/8000 (Driver: NVIDIA 535.146.02)
Issue Description
Hi team,
First of all, thank you for providing this amazing library!
We are currently following one of the examples from this repository, specifically the stormcast example in https://github.com/NVIDIA/physicsnemo/tree/main/examples/generative/stormcast. We would like to reproduce the StormCast results (particularly train.py), but we couldn't find detailed dataset installation instructions.
We've known that how to put datasets specified in https://github.com/NVIDIA/physicsnemo/blob/main/examples/generative/stormcast/utils/data_loader_hrrr_era5.py:
class HrrrEra5Dataset(Dataset):
"""
Paired dataset object serving time-synchronized pairs of ERA5 and HRRR samples
Expects data to be stored under directory specified by 'location' with the
following layout:
| <location>
| -- era5
| -- stats
| -- means.npy
| -- stds.npy
| -- time_means.npy
| -- valid
| -- test
| -- train
| -- <params.conus_dataset_name>
| invariants.zarr
| -- <params.hrrr_stats>
| -- means.npy
| -- stds.npy
| -- valid
| -- test
| -- train
ERA5 stored under <location>/era5/
HRRR stored under <location>/<params.conus_dataset_name>
Within each train/valid/test directory, there should be one zarr file per
year containing the data of interest.
"""
Although the websites of the dataset were provided in README.md
Dataset
In this example, StormCast is trained on the HRRR dataset,
conditioned on the ERA5 dataset.
The datapipe in this example is tailored specifically
for the domain and problem setting posed in the original
StormCast preprint, namely a subset of HRRR and ERA5 variables
in a region over the Central US with spatial extent 1536km x 1920km.
We do not know:
- How to download the ERA5 and HRRR dataset to generate the same structure described in the data_loader?
- How to transform them into the exact data format used by stormcast?
It would be very helpful for us if more detailed instructions were provided for those questions.
Thanks in advance!
Best regards, tang0214
Hello,
Thanks for raising this issue. It is still on our roadmap to provide more complete dataset preparation instructions for this example, but at the moment I can't give you a specific timeline on that.
That said, in the meantime I wanted to at least point you to earth2studio, which has data sources for HRRR and ERA5 defined. You should be able to write a script to download variables and prepare the data accordingly using earth2studio.data utilities. You can also check out the StormCast example there for a reference inference workflow.
Hi Thank you for your reply.
We’ve been working on this issue for two months, but we’re still stuck on running the StormCast training example. The main difficulty lies in preprocessing the raw ERA5 and HRRR datasets downloaded from Earth2Studio, as the current documentation doesn’t clearly explain the required steps.
It would be greatly appreciated if more complete and detailed instructions for dataset preparation could be provided for StormCast.
Thanks again for your support!
Hi @pzharrington , I'm working with @tang0214 on the same issue. Thanks a lot with the recent PR#880, we saw the team made some progress. It really makes difference.
In the new README,
Adding custom datasets While it is possible to train StormCast on custom datasets by formatting them indentically to the Zarr datasets used in the ERA5-HRRR example, a more flexible option is to define a custom dataset object. These datasets must follow the StormCastDataset interface defined in datasets/base.py; see the docstrings in that file for a specification of what the functions must accept and return. You can use the datasets/data_loader_hrrr_era5.py implementation as an example.
is "datasets/base.py" same as "datasets/ _ init _.py"?
Hi @running-berry, thanks for highlighting this issue. The README has a typo, the correct file that specifies the base dataset class is actuallydatasets/dataset.py. We will update this. @jleinonen for viz