pyaerocom icon indicating copy to clipboard operation
pyaerocom copied to clipboard

pyaerocom requests pyaro reader to read data twice

Open dulte opened this issue 1 year ago • 1 comments

When using pyaro for an evaluation, I suspect that pyaro reads the data twice. I need to more evaluations to check if this is the case

dulte avatar Aug 09 '24 11:08 dulte

As mentioned by Lewis in #1302, the reason is due to caching. Without caching, everything works as intended.

With caching: To check if there is a cached file the Ungriddedreader class is initiated, which initiate read_pyaro and thus the pyaro reader class. Since reading of the files is done in the constructor in pyaro readers, the files are read immediately. After this is done, pyaerocom check for a cached file. In summary, since pyaro readers (mostly) read their files in the init method, it is difficult to use pyaro with caching without pyaro reading the data, even though we have cached files.

So the solutions is either to have an explicit read() method in pyaro (no reading in init), or to continue without caching for pyaro...

dulte avatar Sep 16 '24 18:09 dulte

I will adjust pyaerocom and pyaro for a separate read() method. Especially with the parallelization online data providers might not like when we read the same data multiple times in parallel (e.g. ACTRIS-EBAS). So we need the caching to work. But there might be cases where it's not easy to determine if the cache needs to be invalidated (e.g when reading all data first is needed). In that case (again ACTRIS-EBAS reader) I will return today's datestring as revision string for now and allow some grace time in pyaerocom to determine a cache hit.

jgriesfeller avatar Oct 18 '24 12:10 jgriesfeller

I'm not sure I understand the issue here.

pyaro has separated read() methods from the init-method.

If pyaerocom needs a a wrapper which just contains the init-parameters to pyaro, it should be easy to fix that in pyaercom.

heikoklein avatar Oct 21 '24 13:10 heikoklein

@magnusuMET Are you working on this issue now?

heikoklein avatar Nov 08 '24 14:11 heikoklein

I am redoing the pyaro->ungriddeddata so will make sure to test for this, or at least document it

magnusuMET avatar Nov 08 '24 14:11 magnusuMET

Just for documentation: I think the reason the pyaro data is read twice is the following: In readungrideddedbase is a method called var_supported that checks against a list of supported variables of a given reader https://github.com/metno/pyaerocom/blob/c0a8e9d06f48fe6b2150517993bb0d6900d6fd21/pyaerocom/io/readungriddedbase.py#L378-L385

Pyaerocom's pyaro interfgace maps this to self.reader.variables() which provides a list of read variables: https://github.com/metno/pyaerocom/blob/c0a8e9d06f48fe6b2150517993bb0d6900d6fd21/pyaerocom/io/pyaro/read_pyaro.py#L50-L59 and therefore reads the data. Due to missing pyaro internal caching, the data is then read another time when pyaerocom actually works with the data as that is an antirely separate call for pyaro.

A solution for the problem would be to implement a method PROVIDES_VARIABLES in pyro that doesn't read the data

jgriesfeller avatar Nov 11 '24 09:11 jgriesfeller

Thanks for the analysis. I think the solution is to make the pyaro-reader developers aware of the need of an inexpensive variables() call (no need to rename it to PROVIDES_VARIABLES) like the netcdf-rw reader does here: https://github.com/metno/pyaro-readers/blob/002af7966793635fcf3ae62509ae64132a2e9e02/src/pyaro_readers/netcdf_rw/Netcdf_RWTimeseries.py#L66

heikoklein avatar Nov 11 '24 10:11 heikoklein

Pyaerocom 18.11.24: Once solution could be to make a network specific variables.ini file in MyPyaerocom which contains the supported variables. Then a bool could be passed saying whether to use the variables in this .ini file or read all the data with the API. If doing the latter, update the the .ini file.

lewisblake avatar Nov 18 '24 10:11 lewisblake

This should be solved on the pyro-reader implementation level. Readers like the EEA-parquet reader don't have this problem any longer.

heikoklein avatar Nov 19 '24 15:11 heikoklein