pyaerocom requests pyaro reader to read data twice
When using pyaro for an evaluation, I suspect that pyaro reads the data twice. I need to more evaluations to check if this is the case
As mentioned by Lewis in #1302, the reason is due to caching. Without caching, everything works as intended.
With caching: To check if there is a cached file the Ungriddedreader class is initiated, which initiate read_pyaro and thus the pyaro reader class. Since reading of the files is done in the constructor in pyaro readers, the files are read immediately. After this is done, pyaerocom check for a cached file. In summary, since pyaro readers (mostly) read their files in the init method, it is difficult to use pyaro with caching without pyaro reading the data, even though we have cached files.
So the solutions is either to have an explicit read() method in pyaro (no reading in init), or to continue without caching for pyaro...
I will adjust pyaerocom and pyaro for a separate read() method. Especially with the parallelization online data providers might not like when we read the same data multiple times in parallel (e.g. ACTRIS-EBAS). So we need the caching to work. But there might be cases where it's not easy to determine if the cache needs to be invalidated (e.g when reading all data first is needed). In that case (again ACTRIS-EBAS reader) I will return today's datestring as revision string for now and allow some grace time in pyaerocom to determine a cache hit.
I'm not sure I understand the issue here.
pyaro has separated read() methods from the init-method.
If pyaerocom needs a a wrapper which just contains the init-parameters to pyaro, it should be easy to fix that in pyaercom.
@magnusuMET Are you working on this issue now?
I am redoing the pyaro->ungriddeddata so will make sure to test for this, or at least document it
Just for documentation:
I think the reason the pyaro data is read twice is the following:
In readungrideddedbase is a method called var_supported that checks against a list of supported variables of a given reader
https://github.com/metno/pyaerocom/blob/c0a8e9d06f48fe6b2150517993bb0d6900d6fd21/pyaerocom/io/readungriddedbase.py#L378-L385
Pyaerocom's pyaro interfgace maps this to self.reader.variables() which provides a list of read variables:
https://github.com/metno/pyaerocom/blob/c0a8e9d06f48fe6b2150517993bb0d6900d6fd21/pyaerocom/io/pyaro/read_pyaro.py#L50-L59 and therefore reads the data. Due to missing pyaro internal caching, the data is then read another time when pyaerocom actually works with the data as that is an antirely separate call for pyaro.
A solution for the problem would be to implement a method PROVIDES_VARIABLES in pyro that doesn't read the data
Thanks for the analysis. I think the solution is to make the pyaro-reader developers aware of the need of an inexpensive variables() call (no need to rename it to PROVIDES_VARIABLES) like the netcdf-rw reader does here: https://github.com/metno/pyaro-readers/blob/002af7966793635fcf3ae62509ae64132a2e9e02/src/pyaro_readers/netcdf_rw/Netcdf_RWTimeseries.py#L66
Pyaerocom 18.11.24: Once solution could be to make a network specific variables.ini file in MyPyaerocom which contains the supported variables. Then a bool could be passed saying whether to use the variables in this .ini file or read all the data with the API. If doing the latter, update the the .ini file.
This should be solved on the pyro-reader implementation level. Readers like the EEA-parquet reader don't have this problem any longer.