openaq-data-format
openaq-data-format copied to clipboard
Adding in field(s) that reflect RT vs historical/backfilled + QA/QC (for non-RT data)
Suggest a 'Data Type' Field with four categories, such as:
- Real-Time: Any data we currently ingest into the system, and by definition is not QA/QC
- Historical/QA-QC: Backfilled historical data that has gone through QA/QC (e.g. EEA or EPA non-RT data, possibly from researchers)
- Historical/No QA-QC: 'Raw data'
- Historical/Unknown
Or perhaps this is too complicated and it should be broken down into two fields: one RT vis Historical and the other QA/QC: Yes/No/Unknown?
The motivation for this suggested change: Eventually, we will want to be able to backfill data from sources, fill in holes, or take data from sources (e.g. gov't agencies, researchers) that would rather only shared QA/QC'ed data. When using data that is not real-time, and especially from gov't sources, it will be QA/QC'ed unlike the real-time data we are collecting. For this reason, it would be good to have a field that reflects these differences in known data quality. We have gotten requests for this feature.
We have also gotten a related request to provide info on the exact QA/QC procedures of a given place. That'd be awesome, but at this time, I think it will be difficult to parse precisely the QA/QC controls used by each place, and I think it is unreasonable for us to do that at this time or in the near future. Plus, a user can find the data source agency to contact them for more information.
cc: @olafveerman @dolugen @jflasher - I'll be making a series of these for discussion (and using a new label, dark blue 'v2'.) Will be interested in your thoughts on these and other possible changes to the format for v2.
I definitely see the value of having a verified
: true
/ false
Other questions comments:
- assuming that we incorporate something like the
verified
field, what is the use case to mark something as historical? In a certain sense, a scraped real-time value is as historic as one that was added through bulk upload. - for back-filling data, I think we should add an indication of the upload method. Either something simple like:
bulk_upload: true
, or something more complete like:
"bulk_upload": {
"note": "info about the source",
"date": 2015-02-12
}
Apart from the data standard side, it will be interesting to see how we can reconcile this data. Should we toss out the unverified measurements for the same timestamp? How do we do that when timestamps may not be the same?