robosat
robosat copied to clipboard
Implement optional random feature sampling for `rs extract`
For features like buildings we want to sample OpenStreetMap when extracting geometries in rs extract.
The osmium handlers in robosat.osm should take a sampler and then for every OpenStreetMap entity call back ask the sampler if they should handle this entity or not.
For the sampler we have a few options:
- let user pass a number
nof samples (e.g. 20k); we take the firstnand after that just drop features. Problem: we don't randomly sample from all geographical areas; not a good idea - let the user pass a fraction
fof samples (e.g.0.1); in the osm call backs we take a random numberrin [0, 1] and keep the sample if the number ifr < f. Problem: users want a fixed amount of samples (e.g. 20k) but a fraction will change depending on how many features there are in osm. For example with parking lots a fraction of 0.1 is maybe a few thousands, with buildings it's millions. - do two passes over the data; in the first pass count how many features there are in osm, then come up with a fraction to keep; then in the second pass we use approach 2. Problem: needs two passes over the data, and two separate handlers for one feature.
- use an online algorithm for random sampling: reservoir sampling. It's an algorithm for randomly sampling
kitems out of a stream of unknown size. This is a good read.
Tasks:
- [ ] Implement a
ReservoirSamplerclass; it takes a sizenof max. number of items to randomly sample from a stream of unknown size. - [ ] Let our osmium handlers take a
ReservoirSampler; in the osm entity call backs they push features into the reservoir. And in the save function they save features from the reservoir. The reservoir is responsible for keeping or discarding features doing the sampling. - [ ] Add an optional argument to the
rs extracttool for users to set the sample size; pass this argument to the sampler.
Note: now that we have the rs dedupe tool deduplicating detections against OpenStreetMap we need to think about how to design the interface here. The dedupe tool currently ready in the OpenStreetMap features created in the extract tool. If we randomly sample features in extract we can no longer use it for deduplication.