Herbie icon indicating copy to clipboard operation
Herbie copied to clipboard

Specify cutsom filename with 'H.download()'

Open blaylockbk opened this issue 2 years ago • 6 comments

See #58.

You can do this by setting the H.LOCALFILE before doing H.download(), but this only works for downloading full files. Need to come up with a method to override the filename for downloading subsets.

blaylockbk avatar Mar 20 '22 03:03 blaylockbk

also, I should consider changing the filename for subset files

Change how I name subset files...

Instead of hrrr.t21z.wrfprsf00.grib2.subset_71498633cb14ac0f58a9efbcb28ba0029d0b823b
Do this    subset_71498633cb14ac0f58a9efbcb28ba0029d0b823b_hrrr.t21z.wrfprsf00.grib2.

<subset Hash>_<regular name>

This will help with changing file names, always keeping the .grib2 extension on the end, instead of the subset hash

This task is done!

blaylockbk avatar Mar 21 '22 16:03 blaylockbk

Hey Brian,

I just wanted to add a comment here. Let me know if I should post this as a separate issue.

I noticed when downloading subset ECMWF operational forecast grib files that the subset hash portion of the filename changes even though the actual variable being requested does not. This change in the subset hash causes the filenames to be generated in such a way that the forecast hours are listed out of order, thus resulting in a need to rename the files in such a way that they are listed in order (e.g. by removing the hash).

An example of what I mean is shown below. The files shown are for a single variable (MSLP) over a 48hr forecast period. I double checked that all files are indeed subsetting the same variable (MSLP) in pygrib.

subset_0716d9708d321ffb6a00818614779e779925365c__20220512000000-24h-oper-fc.grib2
subset_0716d9708d321ffb6a00818614779e779925365c__20220512000000-39h-oper-fc.grib2
subset_12c6fc06c99a462375eeb3f43dfd832b08ca9e17__20220512000000-0h-oper-fc.grib2
subset_12c6fc06c99a462375eeb3f43dfd832b08ca9e17__20220512000000-15h-oper-fc.grib2
subset_22d200f8670dbdb3e253a90eee5098477c95c23d__20220512000000-18h-oper-fc.grib2
subset_22d200f8670dbdb3e253a90eee5098477c95c23d__20220512000000-30h-oper-fc.grib2
subset_22d200f8670dbdb3e253a90eee5098477c95c23d__20220512000000-9h-oper-fc.grib2
subset_632667547e7cd3e0466547863e1207a8c0c0c549__20220512000000-6h-oper-fc.grib2
subset_761f22b2c1593d0bb87e0b606f990ba4974706de__20220512000000-12h-oper-fc.grib2
subset_7719a1c782a1ba91c031a682a0a2f8658209adbf__20220512000000-42h-oper-fc.grib2
subset_887309d048beef83ad3eabf2a79a64a389ab1c9f__20220512000000-33h-oper-fc.grib2
subset_bc33ea4e26e5e1af1408321416956113a4658763__20220512000000-45h-oper-fc.grib2
subset_bd307a3ec329e10a2cff8fb87480823da114f8f4__20220512000000-21h-oper-fc.grib2
subset_bd307a3ec329e10a2cff8fb87480823da114f8f4__20220512000000-3h-oper-fc.grib2
subset_f6e1126cedebf23e1463aee73f9df08783640400__20220512000000-36h-oper-fc.grib2
subset_fa35e192121eabf3dabf9f5ea6abdbcbc107ac3b__20220512000000-27h-oper-fc.grib2

This does not seem to be occurring for other models I have tried (i.e. GFS, NAM, HRRR), so I assume it has something to do with how ECMWF releases and/or packages their data maybe?

The only quick fix I could suggest would by moving the hash portion of the file name to the end of the file, which would look something like:

20220512000000-0h-oper-fc_subset_12c6fc06c99a462375eeb3f43dfd832b08ca9e17.grib2

Maybe there are issues with this suggestion though. Anyways, thanks for your time and attention. I hope this helps to improve the Herbie package.

mariandob avatar Aug 18 '22 18:08 mariandob

Thanks for bringing this to my attention. The hash is based on numbered grib field. I suspect that variable you are grabbing is in a different "row" for different forecast hours.

I agree, the subset hashing should be appended to the end to help the sort order. In fact, this is how it used to be about 8 months ago. Can't remember why I changed that...probably because I assumed the hash would be the same.

blaylockbk avatar Aug 18 '22 23:08 blaylockbk

I just started looking at this. The reasons the subset hashes are not the same (and not providing a good sort order) is that the grib message number of MSL is different in each GRIB2 file.

  • f00 : MSL is the 32nd grib message
  • f03 : MSL is the 24th grib message
  • f06 : MSL is the 38th grib message
  • etc.

The default ECMWF naming convention doesn't give a good sort order either. If I sort that list you gave, the forecasts are still out of order:

['20220512000000-0h-oper-fc.grib2',
 '20220512000000-12h-oper-fc.grib2',
 '20220512000000-15h-oper-fc.grib2',
 '20220512000000-18h-oper-fc.grib2',
 '20220512000000-21h-oper-fc.grib2',
 '20220512000000-24h-oper-fc.grib2',
 '20220512000000-27h-oper-fc.grib2',
 '20220512000000-30h-oper-fc.grib2',
 '20220512000000-33h-oper-fc.grib2',
 '20220512000000-36h-oper-fc.grib2',
 '20220512000000-39h-oper-fc.grib2',
 '20220512000000-3h-oper-fc.grib2',
 '20220512000000-42h-oper-fc.grib2',
 '20220512000000-45h-oper-fc.grib2',
 '20220512000000-6h-oper-fc.grib2',
 '20220512000000-9h-oper-fc.grib2']

So, inserting the subset hash after the real file name won't help either.

blaylockbk avatar Aug 24 '22 03:08 blaylockbk

I made a tweak to how I generate the hash labels in https://github.com/blaylockbk/Herbie/pull/96. But, it doesn't fix your request to have the files named to be sorted in a specific order when listed.

Herbie now uses a shortened hash which is composed of three parts

  1. The model initialization date
  2. The model forecast lead time
  3. A hash that represents all the GRIB messages in the subset.

for example (notice the hash portion is much sorter 😄)

subset_511533a6__20220801000000-3h-oper-fc.grib2

This still doesn't help your sorting problem (but at least the files with the same initialization time would be grouped together). Is there any particular reason the sort order is important to you, aside from visual inspection of the files?

At any rate, your use case does give me another reason Herbie needs the ability to define a custom filename structure.

blaylockbk avatar Aug 24 '22 03:08 blaylockbk

Thank you for all of your great work on this Brian. The new shortened hash looks nice.

I think the reason why the files are still listed out of order is because the the forecast hour portion of the filename is not zero padded (e.g. ...-3h-oper... is NOT ...-03h-oper...). I don't think there is anything that can be done about that, which is totally fine. My primary reason for caring about this in the first place was that I was looping through a series of ECMWF forecast files to create plots, and it made it easier for me to check for mistakes when the files were ordered. In the end, I do not really think there was any reason for trying to fix the file naming convention other than for visual inspection/ease of visual checking.

Thanks again for the modifications.

mariandob avatar Aug 24 '22 18:08 mariandob