graphcast
graphcast copied to clipboard
Constructing datasets from CDS products
My understanding is that each input data file is constructed from a couple of CDS data products for 'atmospheric' and 'single-level' variables. For example, the two queries below pull the required data (for a specific date), as I understand it:
import cdsapi
c = cdsapi.Client()
c.retrieve(
'reanalysis-era5-pressure-levels',
{
'product_type': 'reanalysis',
'format': 'netcdf',
'variable': [
'geopotential', 'specific_humidity', 'temperature',
'u_component_of_wind', 'v_component_of_wind', 'vertical_velocity',
],
'year': '2023',
'month': '11',
'day': '01',
'time': [
'00:00', '06:00',
],
'pressure_level': [
'1', '2', '3',
'5', '7', '10',
'20', '30', '50',
'70', '100', '125',
'150', '175', '200',
'225', '250', '300',
'350', '400', '450',
'500', '550', '600',
'650', '700', '750',
'775', '800', '825',
'850', '875', '900',
'925', '950', '975',
'1000',
],
},
'atmospheric.nc')
c.retrieve(
'reanalysis-era5-single-levels',
{
'product_type': 'reanalysis',
'variable': [
'10m_u_component_of_wind', '10m_v_component_of_wind', '2m_temperature',
'geopotential', 'land_sea_mask', 'mean_sea_level_pressure',
'toa_incident_solar_radiation', 'total_precipitation',
],
'year': '2023',
'month': '11',
'day': '01',
'time': [
'00:00', '06:00',
],
'format': 'netcdf',
},
'single-level.nc')
Presumably, these files are then spliced together to form an input file. While I'm able to combine the files easily enough, there are subtleties that I'm clearly messing up, like constructing the datetime
and batch
coordinates, which seem to be mandatory (e.g., ValueError: 'datetime' must be in data coordinates.
)
Do you plan to publish a script or instructions for constructing input files from these data products in the format expected by the model? It would be a fantastic help for applying the model!
Many thanks in advance, Dan
Following up - This would definitely be much appreciated ! You have done an amazing job, help us understand how we can feed some real world data if you find some spare time 🤍
Alberto
Would love as well to understand better how we can get or generate these input files. Awesome job, thanks,
Henri
In the same boat. Can't manage to get any data working other than the example data. This would really jumpstart user research!
tv
My understanding is that each input data file is constructed from a couple of CDS data products for 'atmospheric' and 'single-level' variables. For example, the two queries below pull the required data (for a specific date), as I understand it:
import cdsapi c = cdsapi.Client() c.retrieve( 'reanalysis-era5-pressure-levels', { 'product_type': 'reanalysis', 'format': 'netcdf', 'variable': [ 'geopotential', 'specific_humidity', 'temperature', 'u_component_of_wind', 'v_component_of_wind', 'vertical_velocity', ], 'year': '2023', 'month': '11', 'day': '01', 'time': [ '00:00', '06:00', ], 'pressure_level': [ '1', '2', '3', '5', '7', '10', '20', '30', '50', '70', '100', '125', '150', '175', '200', '225', '250', '300', '350', '400', '450', '500', '550', '600', '650', '700', '750', '775', '800', '825', '850', '875', '900', '925', '950', '975', '1000', ], }, 'atmospheric.nc') c.retrieve( 'reanalysis-era5-single-levels', { 'product_type': 'reanalysis', 'variable': [ '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_temperature', 'geopotential', 'land_sea_mask', 'mean_sea_level_pressure', 'toa_incident_solar_radiation', 'total_precipitation', ], 'year': '2023', 'month': '11', 'day': '01', 'time': [ '00:00', '06:00', ], 'format': 'netcdf', }, 'single-level.nc')
Presumably, these files are then spliced together to form an input file. While I'm able to combine the files easily enough, there are subtleties that I'm clearly messing up, like constructing the
datetime
andbatch
coordinates, which seem to be mandatory (e.g.,ValueError: 'datetime' must be in data coordinates.
)Do you plan to publish a script or instructions for constructing input files from these data products in the format expected by the model? It would be a fantastic help for applying the model!
Many thanks in advance, Dan
From what I have noticed, after I retreived the data from CDS and merged both, single and multi-level, using pandas dataframe, and converted it to the same format as the graphcast example data, an xarray, every "data variable" had all 5 indices as coordinates, batch, time, lat, lon, level.
However that is not the case in the example data, "land_sea_mask" has only two coordinates, 'lat' and 'lon', similarly the single-level fields do not have level as a coordinate.
Coordinates can be removed by,
data['land_sea_mask'].isel(batch = 0, time = 0, level = 0)
Other than that, 'time' has to be in pd.Timedelta format.
Recently learned about the ai-models package from ECMWF. Calls the data directly from their MARS archive, no formatting required. Works for Graphcast, Panguweather, and FourCastNet! I haven't had time to try it yet myself but it seems like a golden ticket. @sailgrib @Dan-C-Reed @ab3llini
https://github.com/ecmwf-lab/ai-models
This works for me to download and format 1 days worth of data, you can modify this for multiple days
import cdsapi
import os
import datetime
from netCDF4 import Dataset
import numpy as np
from glob import glob
def download_era5_data(date_str, levels, grid, save_dir):
c = cdsapi.Client()
# download the data at all levels
c.retrieve('reanalysis-era5-complete', {
'date' : date_str,
'levelist': '/'.join([str(x) for x in levels]),
'levtype' : 'pl',
'param' : '/'.join([str(x) for x in era5_pram_codes_levels]),
'stream' : 'oper',
'time' : '00/to/23/by/6',
'type' : 'an',
'grid' : grid,
'format' : 'netcdf',
}, f'{save_dir}/levels.nc')
# download the data at the surface
c.retrieve('reanalysis-era5-single-levels', {
'date' : date_str,
'product_type': 'reanalysis',
'param' : '/'.join([str(x) for x in era5_pram_codes_surface]),
'time' : '00/to/23/by/6',
'grid' : grid,
'format' : 'netcdf',
}, f'{save_dir}/surface.nc')
# download the precipitation data for the given day
c.retrieve('reanalysis-era5-single-levels', {
'date' : date_str,
'product_type': 'reanalysis',
'param' : era5_precipitation_code,
'time' : '00/to/18/by/1',
'grid' : grid,
'format' : 'netcdf',
}, f'{save_dir}/precipitation.nc')
# download the last 6 hours of the previous day
date = datetime.datetime.strptime(date_str, '%Y-%m-%d')
prev_date = date - datetime.timedelta(days=1)
prev_date_str = prev_date.strftime('%Y-%m-%d')
c.retrieve('reanalysis-era5-single-levels', {
'date' : prev_date_str,
'product_type': 'reanalysis',
'param' : era5_precipitation_code,
'time' : '19/to/23/by/1',
'grid' : grid,
'format' : 'netcdf',
}, f'{save_dir}/prev_precipitation.nc')
# calculate the total precipitation for the previous 6 hours
ds_l = Dataset(f'{save_dir}/levels.nc')
ds_s = Dataset(f'{save_dir}/surface.nc')
dsp1 = Dataset(f'{save_dir}/prev_precipitation.nc')
dsp2 = Dataset(f'{save_dir}/precipitation.nc')
precip = np.concatenate([dsp1['tp'][:], dsp2['tp'][:]], axis=0)
_, nlat, nlon = precip.shape
precip_6hr = np.zeros((4, nlat, nlon))
for i in range(4):
precip_6hr[i, :, :] = np.sum(precip[i*6:(i+1)*6, :, :], axis=0)
# create the new combined dataset
ds = Dataset(f'{save_dir}/{date_str}.nc', 'w', format='NETCDF4')
# time dimension
t_dim = ds.createDimension('time', 4)
t_var = ds.createVariable('time', 'f4', ('time',))
t_var.units = 'hours since 1900-01-01 00:00:00'
t_var[:] = ds_s['time'][:]
# batch dimension
b_dim = ds.createDimension('batch', 1)
b_var = ds.createVariable('batch', 'i4', ('batch',))
b_var.units = 'batch'
b_var[:] = [0]
# datetime dimension
dt_dim = ds.createDimension('datetime', 4)
dt_var = ds.createVariable('datetime', 'i4', ('batch', 'datetime',))
dt_var.units = 'hours'
dt_var[:] = [[int(x *6) for x in range(4)]]
# level dimension
l_dim = ds.createDimension('level', len(levels))
l_var = ds.createVariable('level', 'f4', ('level',))
l_var.units = 'millibars'
l_var[:] = ds_l['level'][:]
# latitude and longitude dimensions
lat_dim = ds.createDimension('lat', nlat)
lat_var = ds.createVariable('lat', 'f4', ('lat',))
lat_var.units = 'degrees_north'
lat_var[:] = np.flip(ds_s['latitude'][:], axis=0)
lon_dim = ds.createDimension('lon', nlon)
lon_var = ds.createVariable('lon', 'f4', ('lon',))
lon_var.units = 'degrees_east'
lon_var[:] = ds_s['longitude'][:]
# temperature
t_var = ds.createVariable('temperature', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
t_var.units = 'K'
t_var.missing_value = -32767
t_var[:] = np.expand_dims(np.flip(ds_l['t'][:], axis=2), axis=0)
# u component of wind
u_var = ds.createVariable('u_component_of_wind', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
u_var.units = 'm/s'
u_var.missing_value = -32767
u_var[:] = np.expand_dims(np.flip(ds_l['u'][:], axis=2), axis=0)
# v component of wind
v_var = ds.createVariable('v_component_of_wind', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
v_var.units = 'm/s'
v_var.missing_value = -32767
v_var[:] = np.expand_dims(np.flip(ds_l['v'][:], axis=2), axis=0)
# w vertical velocity
w_var = ds.createVariable('vertical_velocity', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
w_var.units = 'Pa/s'
w_var.missing_value = -32767
w_var[:] = np.expand_dims(np.flip(ds_l['w'][:], axis=2), axis=0)
# specific humidity
q_var = ds.createVariable('specific_humidity', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
q_var.units = 'kg/kg'
q_var.missing_value = -32767
q_var[:] = np.expand_dims(np.flip(ds_l['q'][:], axis=2), axis=0)
# geopotential
z_var = ds.createVariable('geopotential', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
z_var.units = 'm^2/s^2'
z_var.missing_value = -32767
z_var[:] = np.expand_dims(np.flip(ds_l['z'][:], axis=2), axis=0)
# 10m u wind
u10_var = ds.createVariable('10m_u_component_of_wind', 'f4', ('batch', 'time', 'lat', 'lon',))
u10_var.units = 'm/s'
u10_var.missing_value = -32767
u10_var[:] = np.expand_dims(np.flip(ds_s['u10'][:], axis=1), axis=0)
# 10m v wind
v10_var = ds.createVariable('10m_v_component_of_wind', 'f4', ('batch', 'time', 'lat', 'lon',))
v10_var.units = 'm/s'
v10_var.missing_value = -32767
v10_var[:] = np.expand_dims(np.flip(ds_s['v10'][:], axis=1), axis=0)
# 2m temperature
t2m_var = ds.createVariable('2m_temperature', 'f4', ('batch', 'time', 'lat', 'lon',))
t2m_var.units = 'K'
t2m_var.missing_value = -32767
t2m_var[:] = np.expand_dims(np.flip(ds_s['t2m'][:], axis=1), axis=0)
# mean sea level pressure
msl_var = ds.createVariable('mean_sea_level_pressure', 'f4', ('batch', 'time', 'lat', 'lon',))
msl_var.units = 'Pa'
msl_var.missing_value = -32767
msl_var[:] = np.expand_dims(np.flip(ds_s['msl'][:], axis=1), axis=0)
# toa solar radiation
tisr_var = ds.createVariable('toa_incident_solar_radiation', 'f4', ('batch', 'time', 'lat', 'lon',))
tisr_var.units = 'J/m^2'
tisr_var.missing_value = -32767
tisr_var[:] = np.expand_dims(np.flip(ds_s['tisr'][:], axis=1), axis=0)
# precipitation
precip_var = ds.createVariable('total_precipitation_6hr', 'f4', ('batch', 'time', 'lat', 'lon',))
precip_var.units = 'm'
precip_var.missing_value = -32767
precip_var[:] = np.expand_dims(np.flip(precip_6hr, axis=1), axis=0)
# geopotential at the surface
gh_var = ds.createVariable('geopotential_at_surface', 'f4', ('lat', 'lon',))
gh_var.units = 'm^2/s^2'
gh_var.missing_value = -32767
gh_var[:] = np.flip(ds_s['z'][0, :, :], axis=0)
# land sea mask
lsm_var = ds.createVariable('land_sea_mask', 'f4', ('lat', 'lon',))
lsm_var.units = '1'
lsm_var.missing_value = -32767
lsm_var[:] = np.flip(ds_s['lsm'][0, :, :], axis=0)
ds.close()
return f'{save_dir}/{date_str}.nc'
where
# parameter codes information at https://apps.ecmwf.int/codes/grib/param-db/
era5_pram_codes_levels = [
130, # temperature
131, # u component of wind
132, # v component of wind
135, # w vertical velocity
133, # q specific humidity
129, # z geopotential
]
era5_pram_codes_surface = [
165, # 10u 10m component of wind
166, # 10v 10m component of wind
167, # 2t temperature
151, # msl mean sea level pressure
212, # tisr toa solar radiation
129, # z geopotential at the surface
172, # lsm land sea mask
]
era5_precipitation_code = 228
# prod model
prod_model_levels = [1, 2, 3, 5, 7, 10, 20, 30, 50, 70, 100, 125, 150, 175, 200, 225, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000]
prod_model_grid = '0.25/0.25'
prod_checkpoint = 'params/params_GraphCast - ERA5 1979-2017 - resolution 0.25 - pressure levels 37 - mesh 2to6 - precipitation input and output.npz'
# small models
small_model_levels = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]
small_model_grid = '1.0/1.0'
small_checkpoint = 'params/params_GraphCast_small - ERA5 1979-2015 - resolution 1.0 - pressure levels 13 - mesh 2to5 - precipitation input and output.npz'