openaq-fetch icon indicating copy to clipboard operation
openaq-fetch copied to clipboard

Missing EEA measurements (plus potential fix)

Open vituslehner opened this issue 4 years ago • 6 comments

Note: Fix proposed in the end.

It seems like there was a change in how EEA up-to-date data gets published which might be the reason for why several measurement points are missing (also in Germany). In EEA's CSV files the value_datetime_inserted field often seems to be before the actual value_datetime_end field. To me, this is an issue on EEA's side. How can a measurement be inserted without having completed yet?

This issue seems the OpenAQ fetcher to miss out numerous measurements. See this line in the eea-direct adapter: https://github.com/openaq/openaq-fetch/blame/develop/adapters/eea-direct.js#L91 There it gets checked that the value_datetime_inserted field is after the last fetch time. But because this field stays constant over several measurements (thus are outdated), the fetcher ignores many of them.

Example is location DEHH059: https://openaq.org/#/location/DEHH059?parameter=no2&_k=fpzqag

The actual data from EEA contains hourly data (including the seemingly faulty value_datetime_inserted). This leads to OpenAQ having only two data points per day. Attached you can find a sample CSV subset for the station.

I was able to observe this for many German as well as Estonian stations. That's why I think the issue might be related to #677.

I think this could be resolved if the fetcher checks for the value_datetime_end field instead of the value_datetime_inserted field. Might there be a reason to not do this?

vituslehner avatar Mar 05 '20 13:03 vituslehner

@vituslehner thanks for pointing this out!

It looks like there are some instances where value_time_inserted is even before value_time_start 🤦‍♀
Screen Shot 2020-03-20 at 4 09 08 PM

Highlighted in green is the data that's being captured, yellow is what we're missing.

Like you said, we're only capturing the data point where value_datetime_inserted is after value_datetime_end and the value_validity flag is 1.

The fix is a bit tricky. A couple of options and why they might not work:

  1. Check for value_datetime_end. We currently only look at 2 hours of data at a time because otherwise the system would be overwhelmed by the amount of data it would have to ingest. The problem with looking at value_datetime_end is that there's no guarantee it would be inserted within 2 hours of value_datetime_end (thus miss being captured by fetch). This is illustrated with Austria's data: Screen Shot 2020-03-20 at 4 09 38 PM

For that first measurement, it was inserted ~8 hours after value_datetime_end. The system would be overwhelmed if we looked for the past 8 hours of data.

  1. Check for value_datetime_updated. It looks like it would work with Germany's data. However, it's unclear if in some cases (like Austria's) the value is actually inserted at value_datetime_inserted and the value is updated after being QA/QC'ed at value_datetime_updated. Because we only have real-time data on the platform, and don't currently have a way to demarcate QA/QC data in the system, this would not work. I could be wrong about value_datetime_updated though, I couldn't find documentation to verify this. Maybe the value_verification column indicates QA/QC status?

It seems like we'll have to use some combination of value_datetime_inserted, value_datetime_end and value_datetime_updated for it to work for all the different ways the EEA data is inserted. I will need to do some thinking about how best to do this and get back to you. Open to suggestions!

sruti avatar Mar 20 '20 23:03 sruti

Just wanted to add that I found the documentation and the value_verification flag is defined as follows:

1 - Data has gone through full quality assurance and quality control by data provider/owner. Data is considered verified and can be formally used for its reporting purposes.

2 - Data has gone through initial checks by data provider/owner. Extra quality assurance and quality control verifications have been carried out compared to "Not verified" data. Data Verification might be carried out on an ongoing basis and is nominally a process to "clean-up" the initial "Not verified" data. Any corrections to the data made during the verification process are changes the status flag to "Preliminary verified". The data with this data flag is considered provisional and can be used as provisional. This flag is independant to the validity status.

3 - Data has not gone through (or has gone through minimal) checks by data provider/owner. Some basic screening criteria in order to exclude clearly faulty data as far as possible maybe used. The data with this data flag is considered provisional and can be used as provisional. This flag is independant to the validity status.

By getting data based on value_datetime_inserted most of it probably falls into category 3. If we use value_datetime_updated, we'll have to make sure we don't get fully QA/QC'ed data.

sruti avatar Mar 30 '20 08:03 sruti

@majesticio (from #917 )

I could be wrong but I dont think this was ever the issue. The timeLastInserted value that is used in the filter mentioned above is derived from the current time, not from that spreadsheet. So it was always just filtering out values that were inserted in the last two hours. I have not dug into the commits for that file to see if that was ever different so maybe Im wrong.

But if you look at the most recent data available (https://discomap.eea.europa.eu/map/fme/latest/) it looks like all of the data that falls in that 2hr window is missing. There is a placeholder for it but the first record that actually has a value in it (marked as 1 for validation) has an insert time of about 8 hrs ago.

caparker avatar Feb 03 '23 21:02 caparker

Keep in mind this issue predates the current ingestion method so if there's parts of this that ingestion handles better. Might be talking about two different realities given the vintage of this issues

russbiggs avatar Feb 03 '23 21:02 russbiggs

The issue is in the fetcher, not the ingestion. The fetcher is not even fetching data to pass to the ingester.

caparker avatar Feb 03 '23 21:02 caparker

I suggest that we set it up to pass an actual value for the timeLastInserted which could be maintained in the db for each provider and/or source. And then we ignore the inserted time in the csv and just use the end time, making sure that we are not pulling data down before that end time. It looks like the updated time keeps updating but I assume that the actual value is not being updated after the end time.

caparker avatar Feb 03 '23 21:02 caparker