openaq-fetch icon indicating copy to clipboard operation
openaq-fetch copied to clipboard

Filtering out duplicates

Open magsyg opened this issue 5 years ago • 6 comments

In the countries with multiple sources there are often a lot of overlapping with with stations. Often the same stations from different sources have slightly different geolocation, even though they are the same they get registered as two different stations

Should there be a method to filter out the duplicates?

I propose to:

filter created from https://api.openaq.org/v1/locations?country=COUNTRYCODE where the locations and parameters are added to a filter array, with their latitude and longitude rounded up to either 4 - 5 decimals. And then removing, or ignoring sources with the same rounded geolocation, if they are overlapping sources.

Is this approach desired?

magsyg avatar Apr 21 '20 09:04 magsyg

I think there are two pieces of this, the fetch side and the API side (they're decoupled via the database). I think it may be better to handle this on the fetch side ultimately. Right now, the unique index on the database is location, city, parameter, date (https://github.com/openaq/openaq-api/blob/develop/migrations/20160109181033_initial.js#L18). I'm wondering if we should change that to be something like round(lat, 5), round(lon, 5), parameter, date. We're using PostGIS, so there may be a nicer way to handle this location component. As part of this, I think we'd need to say we're only accepting measurements that have coordinates to 5 decimal degrees. I am not sure how much data this would cut out based on what we're currently receiving. I also don't know what it'd take to create this new unique index (time wise) with the number of points that are in the database. That is to say, the database may be down for a bit if this can't be done in the background.

But if we did it that way, I believe this would take care of any duplicate measurements being inserted in the first place and avoid the need to handle anything on the API side?

jflasher avatar Apr 23 '20 21:04 jflasher

How long will this take probably take? The EEA sources for Italy and Poland contains a lot of extra duplicates, and meanwhile the updates to the database are in progress, I think of making a makeshift-filtering method, to avoid these duplicates, as proposed in #712

magsyg avatar Apr 29 '20 12:04 magsyg

Spent some time digging into this. To better understand overlap between existing stations and incoming stations if we add in EEA data, I wrote a quick (very unoptimized) script. I wanted to know 1) how many stations are likely duplicated 2) any existing stations the EEA leaves out 3) are the stations left out by the EEA inactive?

const axios = require('axios')

async function main(country, diffThreshold) {
    try {

        //Get new stations from EEA (battuta file)
        let response = await axios.get('http://battuta.s3.amazonaws.com/eea-stations-all.json')
        let coordinatesEEA = response.data.map(location => {
            if (location.stationId.slice(0, 2) === country) {
                return { latitude: location.latitude, longitude: location.longitude }
            }
        })
        coordinatesEEA = coordinatesEEA.filter(location => location != undefined)
        console.log(`New EEA stations: ${coordinatesEEA.length}`)
        coordinatesEEA.map(location => console.log(`${location.latitude},${location.longitude}`))

        //Get existing stations from platform
        response = await axios.get(`https://api.openaq.org/v1/locations?country=${country}&limit=200`)
        let coordinatesExisting = response.data.results.map(location => {
            return { latitude: location.coordinates.latitude, longitude: location.coordinates.longitude }
        })
        console.log(`Existing stations: ${coordinatesExisting.length}`)
        coordinatesExisting.map(location => console.log(`${location.latitude},${location.longitude}`))

        //Get list of coordinates that are similar (likely to be same station)
        let similarCoords = coordinatesEEA.flatMap(newLocation =>
            coordinatesExisting.map(location =>
                compareCoordinates(newLocation, location, diffThreshold)
                    ? { new: `${newLocation.latitude},${newLocation.longitude}`, existing: `${location.latitude},${location.longitude}` } : null
            ).filter(coords => (coords)))
        console.log(`Similar coords: ${similarCoords.length}`)
        console.log(similarCoords)

        //Get list of existing stations that haven't been updated recently
        const stationsInactive = response.data.results.filter(location => location.lastUpdated.slice(0, 4) != '2020')
        console.log(`Inactive stations: ${coordinatesExisting.length}`)
        stationsInactive.map(location => console.log(`${location.coordinates.latitude},${location.coordinates.longitude}`))

    }
    catch (error) {
        console.log(error)
    }
}

function compareCoordinates(location1, location2, diffThreshold) {
    return (Math.abs(location1.longitude - location2.longitude) < diffThreshold
        && Math.abs(location1.latitude - location2.latitude) < diffThreshold)
}

It prints out the coordinates for each set and likely overlapping stations based on the diffThreshold, input where .1 is ~1km and .00001 is ~1m source.

Ultimately, I used the very unscientific but effective method of plotting the different sets of coordinates on a map to see overlaps/outliers. Tried this out for Italy and added my observations in #712.

sruti avatar May 01 '20 21:05 sruti

There's an additional problem here which is due to this issue. Basically, our reference file for EEA stations is out of date. To again check for duplicates with the real number of EEA stations, used this script:

const parse = require('csv-parse/lib/sync')
const fs = require('fs')
function parseEEAMetadata(country) {
    let data
    let seen = new Set()
    try {
        const content = fs.readFileSync('PanEuropean_metadata.csv')
        data = parse(content, { delimiter: '\t' }).slice(1, -1)
        const coordinatesEEA = data.filter(location => location[0] === country)
            .map(location => { return { latitude: location[15], longitude: location[14] } })
            .filter(location => {
                const duplicate = seen.has(location.latitude)
                seen.add(location.latitude)
                return !duplicate
            })
        return coordinatesEEA
    } catch (e) {
        console.log("could not parse: ", e);
    }
}

Planning to prioritize fixing this issue as it affects all existing EEA sources.

sruti avatar May 01 '20 21:05 sruti

All this to say, I think it's best to stick to national-level sources, like EEA if possible. It's less to manage and we will have less worries about duplicate data. At this point, I think it's not the best idea to invest time on the filtering method because it applies to very few sources and it seems like it will be a big undertaking. Also don't think it makes sense to be pulling in a lot of data from different sources, only to have half of it be filtered out. Better to optimize the sources themselves! Definitely something to keep in mind as we continue to expand the platform though.

sruti avatar May 01 '20 21:05 sruti

I think checking for duplicates with coordinates could be dicey, and may have unintended consequences/bugs. IMO the best method would be to make a set of unique identifiers, like serial numbers or device IDs to filter out duplicate monitors

majesticio avatar Feb 03 '23 17:02 majesticio