gabbar
gabbar copied to clipboard
Prototype an anomaly detection model for highways
Ref: https://github.com/mapbox/gabbar/issues/80 and https://github.com/mapbox/gabbar/issues/69
We all know labelled data is gold in machine learning land. But, in the context of OpenStreetMap and osmcha, there are two things:
1. Labelled harmful highways
On osmcha, labelling happens at changeset level. A changeset is either good or harmful. But, there are scenarios where not all features of a changeset are harmful. So, we should not assume all features of harmful changeset are harmful. In Gabbar, we worked with changesets where one feature was touched thus, if the changeset was good, the only feature was good and if the changeset is harmful, the only feature was harmful as there was only one feature in the changeset.
This worked ok for a generic classifier, but in the highway classifier, the size of the dataset is too low. For example, the latest highway classier was trained on 2217
good highways and a mere 55
harmful highways. Yes, the number of harmful highways is low. This means, supervised learning algorithms might not be fed enough to be strong and healthy.
2. Labelled good highways
But, we have an abundance (comparatively) of labelled highway that are good. The 2217
changesets from ^ are there but there is even more. When a changeset is labelled good, it is safe to assume all features in the changeset are good. Which in-turn means, all features in the changeset are good too including the highway features. Yay!
There are 50,000+
changesets labelled on osmcha and assuming every changeset has atleat one highway as highway are one among the frequently edited features on OpenStreetMap, we could potentially have around 50,000+
labelled good highways. This might be an interesting scenario to try anomaly detection models.
From https://en.wikipedia.org/wiki/Anomaly_detection
anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.
Another potentially big advantage of anomaly detection models is that they flag when things are different than expected. This means, we are now not limited by the different types of harmful edits we have seen or given the model for training but in a way are ready for new and unknown types of anomalies. One important thing about anomaly detection is these models don't tell you whether a changeset is good or bad, they tell you if is something expected or something different.
cc: @anandthakker @geohacker @batpad