pyaerocom icon indicating copy to clipboard operation
pyaerocom copied to clipboard

EMEP reader: Years in path

Open WillemVanCaspel opened this issue 2 years ago • 6 comments

Is there something that can be done for the requirement for a year to be in the path? It is also unclear to me at which part of the path the year should be.

For example, for an evaluation against 2021 data, the following path to model data was not accepted:

Path: /lustre/storeB/users/willemvc/modelruns/Gauss_test_2021/u7_ref Error: Failed to load model data: u7ref (concpm10). Reason Could not find any year in u7_ref

But these paths were okay:

Path: /lustre/storeB/users/willemvc/modelruns/Gauss_test_2021/rep2023 Path: /lustre/storeB/users/willemvc/modelruns/Gauss_test_2021/vra2021

WillemVanCaspel avatar Aug 17 '23 08:08 WillemVanCaspel

This concerns the EMEP reader, right? Aerocom has a year in it's file name. Pyaerocom needs to know the year of the data and since we don't use the model data's time variable (because it's almost never right), the year information has to come from somewhere.

Personally I always wodered why the year is not in the file name...

jgriesfeller avatar Aug 22 '23 07:08 jgriesfeller

The EMEP year information is pretty unnecessary any way, because the year specified in the config for the observational comparison is what's being used for the co-location, right?

As in, an EMEP model path with 2023 in it (like in my above post) will work equally well as an EMEP model path with 2021 in it, when both are being compared against 2021 observations. So the year 2023 in the EMEP path is meaningless, but it still causes Pyaerocom to not evaluate the model when the (or a) year is not there.

So then why is that requirement there in the first place?

WillemVanCaspel avatar Aug 22 '23 07:08 WillemVanCaspel

Keep in mind that in theory pyaerocom should not need to know anything about a config file to run an Aeroval experiment. Like Jan said, the year has to come from somewhere. Whether you put it in the file path or the filename could be an implementation detail worth discussing

lewisblake avatar Aug 23 '23 15:08 lewisblake

Would it then be an idea that, if the evaluation year is specified in the config through e.g. periods=["2021"], then the requirement on a year to be present in the model path or filename can be lifted?

Maybe this all sounds a bit silly. It's just that the year-in-path requirement yet again managed to lead to quite some headache during this year's reporting.

WillemVanCaspel avatar Aug 28 '23 08:08 WillemVanCaspel

From a discussion with Jan at our 2023-09-11 development meeting:

I understand this has been a source of some confusion in the past (for myself included). There are however a few points to keep in mind:

  • How can one run a multi-year analysis if the file path contains no information about the year?
  • A safety mechanism in pyaerocom is that we check files based on the year provided, and if a candidate file doesn't have the correct number of timestamps, we do not read the file. The time variable is too error prone to read in and use for the analysis year. Users often don't manage to put the correct information in this field.
  • If changing the model output is problematic, does it not suffice to create a symbolic link between your data and an Aeroval-analysis compliant file path?

One option could be to create a script to send to the queue which takes EMEP output and converts the file paths to be compliant. We often have to take model output and convert it into Aerocom format. It can be a frustrating and time-consuming extra step but the standards are there to ensure accurate processing.

lewisblake avatar Sep 11 '23 09:09 lewisblake

I just ran my 1st aeroval analysis using EMEP data where I configured the EMEP reading myself. In general, since pyaerocom does not trust the time variable, we need the year information from somewhere. As the EMEP reader is right now, it gets that info from the file path. This is stupid since it prevents using several years because all years have to come from one sub directory. Even multiyear analyses should work, if the year would be added to the filename not the file path. I agree that sometimes it would be helpful to use the time variable, but we needed to look into every file then to read the time variable. It's much faster to get that info from the file name. Another possibility is to reformat EMEP data to aerocom format again...

jgriesfeller avatar Sep 11 '23 13:09 jgriesfeller

It is not clear what the requirements here are and from my work with emep-data I would express them as:

  • The time-variable from the emep/mscw-model are very reliable and should be used where possible. Retrieving the year from the nc-files is not too time-consuming.
  • For multi-year trend-analysis, rather than giving all the file-paths, we want to give a directory containing subdirectories named by year, e.g. 2010, 2015, 2020, or 2010_ref1, 2015_ref1... Subdirectories without year, or subdirectories without Base_*.nc files should be omitted

years read from the data-dir are generally a bad idea, since path usually have both reporting, emission and meteorology-year in the path and there is no convention where to write what. If modellers want to use the Base_.nc data for comparison with a different observation-year, they have to fix the Base_.nc file.

heikoklein avatar Jul 06 '24 13:07 heikoklein