openaq-fetch icon indicating copy to clipboard operation
openaq-fetch copied to clipboard

New data: Italy

Open espenairmine opened this issue 5 years ago • 5 comments

Find the right sources, use and then close #670 #449 #420 #415 #303

Remove duplicates - #710

espenairmine avatar Apr 21 '20 11:04 espenairmine

@magsyg

espenairmine avatar Apr 21 '20 11:04 espenairmine

Italy is currently being sourced from Arpa, for some regions only, hence lacking a lot of data. EEA covers most all of Italy, and includes all local sources ( from Arpa ) except Sicilia. I propose to add EEA as a filler source for Italy, and add more local sources (from Arpa) in addition for a more precise data collection when we have time. There are 3 PRs for local Arpa sources for italy: #721 , #720, #366 #716 which we should use as a first approach. The current sources should be kept, because EEA can be a bit unstable at times. Where there is overlap, the sources will be filtered according to the following process: Current sources - keep all locations EEA - remove locations currently loaded from Arpa For adding additional Arpa sources, EEA filterlist need to updated, meaning the Arpa sources should take priority

magsyg avatar Apr 29 '20 11:04 magsyg

Thanks for linking up all the remaining issues, and for looking into the various Italy adapters! I'm thinking we go with a different approach.

Using the script I detailed in #710, I dug into which stations are likely to be duplicates and if we would lose any stations by disabling current adapters and switching to EEA (given the battuta issue is fixed, otherwise it's ~550 new stations):

New EEA Italy stations: 772 
Existing Italy stations: 104
Inactive stations: 6
Similar coordinates (diffThreshold: 0.00001): 14
[
  { new: '41.768188999999985,12.237048000000001', existing: '41.76819,12.23705' },
  { new: '41.73,13.338330000000001', existing: '41.73,13.33833' },
  { new: '41.77484900000001,12.223413', existing: '41.77485,12.22341' },
  { new: '42.137339999999995,11.79316', existing: '42.13734,11.79316' },
  { new: '42.159949999999995,11.74263', existing: '42.15995,11.74263' },
  { new: '42.102159999999984,11.784360000000001', existing: '42.10216,11.78436' },
  { new: '42.0989,11.81769', existing: '42.0989,11.81769' },
  { new: '42.081824999999995,11.809336', existing: '42.08183,11.80934' },
  { new: '42.07361,11.81592', existing: '42.07361,11.81592' },
  { new: '42.16096999999999,11.90002', existing: '42.16097,11.90002' },
  { new: '42.15223,11.93583', existing: '42.15223,11.93583' },
  { new: '42.26856,11.91091', existing: '42.26856,11.91091' },
  { new: '42.09704999999999,11.788350000000001', existing: '42.09705,11.78835' },
  { new: '42.086802999999996,11.806498000000001', existing: '42.0868,11.8065' }
]

If we increase the diffThreshold to 0.001, the number of stations with similar coordinates increases to 80 (didn't list all of them here). Looking at the numbers, it seems likely most of these are the same station and would be grouped together by the unique ID:

  { new: '44.842499999999994,11.61306', existing: '44.8425,11.6131' },
  { new: '44.82389,9.830279999999998', existing: '44.8239,9.8304' },
  { new: '44.63604999999999,10.90473', existing: '44.637,10.9057' },
  { new: '41.94749999999999,12.46972', existing: '41.94745,12.46959' },
  { new: '41.88306,12.508890000000001', existing: '41.88306,12.50894' },
  { new: '42.42194,12.10917', existing: '42.42206,12.10913' },
  { new: '41.595278,12.653611', existing: '41.59534,12.65358' },
  { new: '42.40417,12.85833', existing: '42.40409,12.85822' },
  { new: '41.46388900000001,12.913056', existing: '41.46402,12.91304' },
  { new: '41.75,13.149721999999999', existing: '41.75,13.14968' },
  { new: '41.768188999999985,12.237048000000001', existing: '41.76819,12.23705' },
  { new: '41.57,13.33722', existing: '41.57,13.33719' },
  { new: '42.157778,11.908611000000002', existing: '42.15774,11.90874' },
  { new: '42.091666999999994,11.8025', existing: '42.09163,11.80247' },
  { new: '42.091667,11.8025', existing: '42.09163,11.80247' },
  { new: '41.99555999999999,12.72639', existing: '41.99568,12.72637' },
  { new: '41.730833,13.004444', existing: '41.73084,13.00435' },
  { new: '41.725,13.009444000000002', existing: '41.72501,13.00957' },
  { new: '44.48333,11.355000000000002', existing: '44.4836,11.355' },
  { new: '44.42861,12.18667', existing: '44.4278,12.1865' },
  { new: '44.51611099999999,10.733889', existing: '44.5162,10.7339' },
  { new: '41.88944399999999,12.266389', existing: '41.88944,12.2663' },
  { new: '41.93277799999999,12.506944', existing: '41.93287,12.50697' },
  { new: '41.85777799999999,12.568611000000002', existing: '41.85772,12.56866' },
  { new: '42.57249999999999,12.961944', existing: '42.57259,12.96198' },

To double check, I mapped all the coordinates to see where there aren't overlaps and it looks like EEA covers pretty much all. Red - existing stations, Green - new EEA stations, purple - inactive stations: Screen Shot 2020-05-01 at 2 10 14 PM Screen Shot 2020-05-01 at 2 11 12 PM Screen Shot 2020-05-01 at 2 14 56 PM

sruti avatar May 01 '20 21:05 sruti

Based on that, I would say let's disable current adapters, and add Italy through EEA. And then add in local sources if there are gaps. For Italy at least, EEA is more reliable than the current sources/adapters and it's easier to manage the 1 source instead of multiple adapters.

sruti avatar May 01 '20 21:05 sruti

@sruti - Thanks for good feedback.!

Can you clarify ".. would be grouped together by the unique ID"? Does that mean that there exist a uniqe ID <= > lat/lon relationship?

What happens if sources A and B give a measurement for the same lat/lon +time? ( assuming A updates first then B) 1 - Update B will be discarded 2 - Update B will override value from A 3 - There will be two observations in the DB, having the same lat/lon/time

The answer above will have implications to how we treat multiple sources for the same country, with overlapping data.

espenairmine avatar May 04 '20 09:05 espenairmine