modulus
modulus copied to clipboard
🐛[BUG]: Compute dataset statistics on training data
Version
0.6.0
On which installation method(s) does this occur?
Docker
Describe the issue
In examples/weather/dataset_download/start_mirror.py the global_means and global_stds files (used later for normalization) are computed on the entire dataset and not only on the training set.
Current implementation
if cfg.compute_mean_std:
stats_path = os.path.join(cfg.hdf5_store_path, "stats")
print(f"Saving global mean and std at {stats_path}")
if not os.path.exists(stats_path):
os.makedirs(stats_path)
era5_mean = np.array(
era5_xarray.mean(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
)
era5_std = np.array(
era5_xarray.std(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
)
print(f"Finished saving global mean and std at {stats_path}")
Proposed modification
if cfg.compute_mean_std:
# Compute stats only on training data
train_era5_xarray = era5_xarray.sel(
time=era5_xarray.time.dt.year.isin(train_years)
)
stats_path = os.path.join(cfg.hdf5_store_path, "stats")
print(f"Saving global mean and std at {stats_path}")
if not os.path.exists(stats_path):
os.makedirs(stats_path)
era5_mean = np.array(
train_era5_xarray.mean(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
)
era5_std = np.array(
train_era5_xarray.std(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
)
print(f"Finished saving global mean and std at {stats_path}")
Minimum reproducible example
No response
Relevant log output
No response
Environment details
Modulus Docker container version 24.04