skillful_nowcasting icon indicating copy to clipboard operation
skillful_nowcasting copied to clipboard

Loading BOM data into ram

Open cameronko opened this issue 2 years ago • 3 comments
trafficstars

Hello,

I am trying to train this model on Australia's BOM Radar Data, however, I am having trouble loading the data into memory.

I have 1 year's worth of data in netCDF4 format at time steps of 5 minutes. Each time step is a seperate NC file. The file structure to access the precipitation field at 01/01/2022 at 12:30pm would be: BOM Rain Rate Data 2022 (fodler) > 20220101 (folder) > 20220101_123000.nc (the precipitation field is stored as an array of int64 values in mm/h under a variable called 'rain_rate' in the netCDF file). I have tried the netCDF4 and xarray libraries for python and recieve an OOM error.

The problem is if I was to load the all available 2022 data (~300 days), it would require approximately 180 GB of ram, which I do not have. The netCDF must compress the data as the size of 2022 data on disk is ~5 GB.

How would I go about efficiently loading all this data and passing it into the DGMR?

Thanks for your help.

cameronko avatar Dec 23 '22 12:12 cameronko

Hey, sorry for the delay, I just missed this issue. I wouldn't load it all into ram at once. For training, we lazily load the data we need from either the UK Nimrod dataset or US MRMS, so only have the small examples in memory at any given time. We tend to use Zarr, and xarray, which work fairly well for doing that, but yeah, not loading it all into memory at once.

jacobbieker avatar Feb 01 '23 12:02 jacobbieker

@all-contributors please add @primeoc for question

peterdudfield avatar Mar 24 '23 09:03 peterdudfield

@peterdudfield

I've put up a pull request to add @primeoc! :tada:

allcontributors[bot] avatar Mar 24 '23 09:03 allcontributors[bot]