SCEE icon indicating copy to clipboard operation
SCEE copied to clipboard

WIP: Custom quest to import missing locations from ATP in the Czech Republic

Open vfosnar opened this issue 1 year ago • 12 comments
trafficstars

Right now I'm finishing the backend so when I'm sure things are going to be stable this can be merged. The included image is public domain (https://www.svgrepo.com/svg/491993/electricity-bill).

One thing I'm a bit stuck on but isn't a showstopper is that fun onSyncedEdit(edit: ElementEdit, id: String?) never gets called. So if the user doesn't have internet connection at the time of the edit, it doesn't get synced to my backend that otherwise only updates this data once per day.

vfosnar avatar Mar 16 '24 23:03 vfosnar

@matkoniecz you are working with ATP stuff, so I thought you might be interested in this quest?

Helium314 avatar Mar 17 '24 19:03 Helium314

Yeah, I am implementing exactly this right now :/

matkoniecz avatar Mar 17 '24 20:03 matkoniecz

Right now I'm finishing the backend

Is it open source or will be?

matkoniecz avatar Mar 17 '24 20:03 matkoniecz

missing locations from ATP

how it detects whether features are missing but present in ATP? Is it skipping low-quality spiders?

in the Czech Republic

what is the reason for such limitation? costs of the server?

matkoniecz avatar Mar 17 '24 20:03 matkoniecz

How user can change location compared to what ATP reports? Note that in basically all cases location reported in ATP is not good enough for OSM purposes, mismatches ranging from few meters with offset by 20m or 40m being normal.

(there are also much more offset objects, but at that point it is also getting to cases "ATP claims it exists, but it does not exist" where something is offset by 2km, 200km or 2000km)

matkoniecz avatar Mar 17 '24 20:03 matkoniecz

Oh cool,

first things first the primary target for this project was to update already existing elements, but I realized at least half of the entries in the Czech Republic are missing.

At this point it's a bunch of python scripts bodged together but I'm slowly cleaning it up.

Is it open source or will be?

yes, it is @ https://gitlab.com/vfosnar/atpsync and https://gitlab.com/vfosnar/atpsync_backend

how it detects whether features are missing but present in ATP?

For finding already matched elements it checks if either brand:wikidata and ref matches or if the tag and value of ref:atp:<spider name> matches.

For finding previously unmatched ones it searches the radius of 100 meters. If it finds a match, for example within 20 meters, it checks double the radius, in this case 40m, to rule out possible duplicates/collisions. These are not common but they happen and need to be resolved manually. (for example both node and way have the same tags)

Is it skipping low-quality spiders?

I hand picked some spiders as there are collisions between them and a lot of the data straight can't be included in quests, i.e. Tesco has a precision on a city level. It only makes sense to monitor such data.

what is the reason for such limitation? costs of the server?

I wanted to start small, I know the Czech Republic better than the rest of the world and I'm more aware of local conventions. There is no technical reason.

How user can change location compared to what ATP reports?

When creating/editing element in SCEE they can move the node wherever they want, after that the server will match it regardless of where it's located based on it's ref:atp:<spider name>.

I'm open to collaboration but

  • Targeting the whole world from the start doesn't feel right.
  • There is a lot of invalid data. I had to modify basically every scraper I'm using to actually scrape valid informations.

What is your take on this, @matkoniecz ?

vfosnar avatar Mar 17 '24 22:03 vfosnar

btw I have a (outdated) map (based on outdated source, i.e. ref:atp:<spider name> -> ref) if you want to see what I'm currently at. https://atpsync.vfosnar.cz/

vfosnar avatar Mar 17 '24 22:03 vfosnar

Targeting the whole world from the start doesn't feel right.

My plan was to target my own country at start, to allow testing quality of what is being suggested.

But with design that would allow processing worldwide dataset in future.

There is a lot of invalid data. I had to modify basically every scraper I'm using to actually scrape valid informations.

My plan for that was to import only shop name, brand and type and ignore all other tags as I worried about this.

For example I would consider https://github.com/alltheplaces/alltheplaces/issues/6943 to be prerequisite to use any opening_hours tags from ATP.

Though even top-level tags often need adjustments (I opened https://github.com/alltheplaces/alltheplaces/pull/7344 https://github.com/alltheplaces/alltheplaces/pull/7572 https://github.com/alltheplaces/alltheplaces/pull/6763 so far and reported more cases, see say https://github.com/alltheplaces/alltheplaces/issues/7600)

matkoniecz avatar Mar 17 '24 22:03 matkoniecz

I was just starting about how to design things so performance/costs will still allow worldwide processing (now and in future) without strain. So far I was just reminding myself about some data structures (that I used long time ago).

Are you doing anything smart here with matching OSM data and ATP data? Or maybe I am overthinking and brute force scales up well at least to country level? Though maybe it will stop working with enabling more than few spiders and one country.

After all, brute force worked well for my matcher of a single spider across Europe but running something like that 2000 times does not sound like a good idea to me. Though, maybe comparing each spider with Overpass query output for it is not so bad idea after all?

In general I would really prefer to cooperate on existing project rather than start a separate brand new one! I will look around the code to see what you did and maybe I will send some PRs. BTW, I would consider having readme (also?) in English.

And at least I can try playing with it to judge how well this works in action in StreetComplete as a quest.

matkoniecz avatar Mar 17 '24 23:03 matkoniecz

Are you doing anything smart here with matching OSM data and ATP data?

Not much... Python is unsurprisingly the largest bottleneck rn. I'm doing an overpass query for each spider, for example KFC will search for fast foods with KFC or kentucky fried chicken in the name. I also considered searching within the brand but I was not able to write such Overpass query.

will send some PRs.

Treat this more as a POC project. If I'd want to scale I'd rather pick a better language like Rust or go. The code is already slow af and it's worth considering a rewrite.

BTW, I would consider having readme (also?) in English.

rn it's just a useless summary of a specific part of the code anyway :)

can you write me on matrix so we don't spam here? @me:vfosnar.cz

vfosnar avatar Mar 17 '24 23:03 vfosnar

Or maybe better keep using a simpler language on the backend and use PostGIS to do the OSM lookups locally. This allows to optimize for specific queries and doesn't overload Overpass servers.

vfosnar avatar Mar 17 '24 23:03 vfosnar

can you write me on matrix so we don't spam here? @me:vfosnar.cz

I posted there

matkoniecz avatar Mar 18 '24 09:03 matkoniecz

@vfosnar so you gave up on this? https://atpsync.vfosnar.cz/ also shows nothing for me.

Helium314 avatar Aug 08 '24 19:08 Helium314

I'm still continuing development and want to 100% finish this. I rethinked the system so it can be used in a more general way with more datasets. I want to support not only creating new nodes in SCEE but also manual verification of updates and deletes for some datasets. I will prioritize the node creation as I can reuse existing UI.

I hope I will finish it until summer holiday ends including SCEE integration. The problem is that we are currently moving, I have a summer job and I have to prepare for final exams :)

I will start work on SCEE by the end of the following week if stuff goes as planned.

For atpsync I have lots of changes locally because of major rewrites and will be publishing them here.

The demo is dead because there is not a lot of space on my Oracle instance and the app is outdated anyway. I've been given access to a school vps so that should solve the problem.

vfosnar avatar Aug 08 '24 20:08 vfosnar