openaq-fetch
openaq-fetch copied to clipboard
Missing EEA measurements (plus potential fix)
Note: Fix proposed in the end.
It seems like there was a change in how EEA up-to-date data gets published which might be the reason for why several measurement points are missing (also in Germany). In EEA's CSV files the value_datetime_inserted
field often seems to be before the actual value_datetime_end
field. To me, this is an issue on EEA's side. How can a measurement be inserted without having completed yet?
This issue seems the OpenAQ fetcher to miss out numerous measurements. See this line in the eea-direct adapter: https://github.com/openaq/openaq-fetch/blame/develop/adapters/eea-direct.js#L91 There it gets checked that the value_datetime_inserted
field is after the last fetch time. But because this field stays constant over several measurements (thus are outdated), the fetcher ignores many of them.
Example is location DEHH059: https://openaq.org/#/location/DEHH059?parameter=no2&_k=fpzqag
The actual data from EEA contains hourly data (including the seemingly faulty value_datetime_inserted
). This leads to OpenAQ having only two data points per day. Attached you can find a sample CSV subset for the station.
I was able to observe this for many German as well as Estonian stations. That's why I think the issue might be related to #677.
I think this could be resolved if the fetcher checks for the value_datetime_end
field instead of the value_datetime_inserted
field. Might there be a reason to not do this?
@vituslehner thanks for pointing this out!
It looks like there are some instances where value_time_inserted
is even before value_time_start
🤦♀
Highlighted in green is the data that's being captured, yellow is what we're missing.
Like you said, we're only capturing the data point where value_datetime_inserted
is after value_datetime_end
and the value_validity
flag is 1.
The fix is a bit tricky. A couple of options and why they might not work:
- Check for
value_datetime_end
. We currently only look at 2 hours of data at a time because otherwise the system would be overwhelmed by the amount of data it would have to ingest. The problem with looking atvalue_datetime_end
is that there's no guarantee it would be inserted within 2 hours ofvalue_datetime_end
(thus miss being captured by fetch). This is illustrated with Austria's data:
For that first measurement, it was inserted ~8 hours after value_datetime_end
. The system would be overwhelmed if we looked for the past 8 hours of data.
- Check for
value_datetime_updated
. It looks like it would work with Germany's data. However, it's unclear if in some cases (like Austria's) the value is actually inserted atvalue_datetime_inserted
and the value is updated after being QA/QC'ed atvalue_datetime_updated
. Because we only have real-time data on the platform, and don't currently have a way to demarcate QA/QC data in the system, this would not work. I could be wrong aboutvalue_datetime_updated
though, I couldn't find documentation to verify this. Maybe thevalue_verification
column indicates QA/QC status?
It seems like we'll have to use some combination of value_datetime_inserted
, value_datetime_end
and value_datetime_updated
for it to work for all the different ways the EEA data is inserted. I will need to do some thinking about how best to do this and get back to you. Open to suggestions!
Just wanted to add that I found the documentation and the value_verification flag
is defined as follows:
1 - Data has gone through full quality assurance and quality control by data provider/owner. Data is considered verified and can be formally used for its reporting purposes.
2 - Data has gone through initial checks by data provider/owner. Extra quality assurance and quality control verifications have been carried out compared to "Not verified" data. Data Verification might be carried out on an ongoing basis and is nominally a process to "clean-up" the initial "Not verified" data. Any corrections to the data made during the verification process are changes the status flag to "Preliminary verified". The data with this data flag is considered provisional and can be used as provisional. This flag is independant to the validity status.
3 - Data has not gone through (or has gone through minimal) checks by data provider/owner. Some basic screening criteria in order to exclude clearly faulty data as far as possible maybe used. The data with this data flag is considered provisional and can be used as provisional. This flag is independant to the validity status.
By getting data based on value_datetime_inserted
most of it probably falls into category 3. If we use value_datetime_updated
, we'll have to make sure we don't get fully QA/QC'ed data.
@majesticio (from #917 )
I could be wrong but I dont think this was ever the issue. The timeLastInserted
value that is used in the filter mentioned above is derived from the current time, not from that spreadsheet. So it was always just filtering out values that were inserted in the last two hours. I have not dug into the commits for that file to see if that was ever different so maybe Im wrong.
But if you look at the most recent data available (https://discomap.eea.europa.eu/map/fme/latest/) it looks like all of the data that falls in that 2hr window is missing. There is a placeholder for it but the first record that actually has a value in it (marked as 1 for validation) has an insert time of about 8 hrs ago.
Keep in mind this issue predates the current ingestion method so if there's parts of this that ingestion handles better. Might be talking about two different realities given the vintage of this issues
The issue is in the fetcher, not the ingestion. The fetcher is not even fetching data to pass to the ingester.
I suggest that we set it up to pass an actual value for the timeLastInserted
which could be maintained in the db for each provider and/or source. And then we ignore the inserted time in the csv and just use the end time, making sure that we are not pulling data down before that end time. It looks like the updated time keeps updating but I assume that the actual value is not being updated after the end time.