dawarich icon indicating copy to clipboard operation
dawarich copied to clipboard

Normalize Geodata Schema to avoid repeated lookups of same lat/lon

Open lindner opened this issue 1 year ago • 4 comments

Describe the bug The Points table has denormalized city, country and geodata, normalize this.

Version 0.16.4

To Reproduce Import lots of Points, notice that reverse Geocoding is taking a long time. Look at schema and data to notice a lot of duplication.

Expected behavior The schema should be adjusted use a single geodata/city/country for a given lat/lon. Alternatively the ReverseGeocode job could look for existing data from the other Points to see if data has already been fetched.

Additional context Here are some statistics from my import:

dawarich=# SELECT latitude, longitude, count(*) AS c FROM points pp GROUP BY latitude, longitude ORDER BY c DESC LIMIT 10;

 latitude  |  longitude  |   c
-----------+-------------+-------
 37.834035 | -122.196643 | 72248
 37.833984 | -122.196687 | 21906
 37.834008 | -122.196732 |  5542
 37.834109 | -122.196643 |  4701
 37.834065 | -122.196699 |  3841
 37.834071 | -122.196710 |  3734
 37.834046 | -122.196699 |  3656
 37.834082 | -122.196732 |  3174
 37.833961 | -122.196643 |  2806
 37.832880 | -122.197131 |  2768
(10 rows)

As a workaround you can use this UPDATE sql to populate null fields where the result is already fetched. It processes 1000 item batches, so you just keep running it until you get 0 rows.

WITH t AS (SELECT p1.id, p2.city, p2.country, p2.geodata 
                     FROM points p1 
                     INNER JOIN points p2 ON p1.latitude = p2.latitude AND p1.longitude = p2.longitude 
                             AND p1.city IS NULL AND p2.city IS NOT NULL AND p1.id <> p2.id  LIMIT 1000) 
UPDATE points SET city = t.city, country = t.country, geodata = t.geodata FROM t WHERE points.id = t.id;

lindner avatar Nov 21 '24 16:11 lindner

An CREATE INDEX idx_points_lat_lon ON points (latitude, longitude); will speed up the query.

tabacha avatar Dec 05 '24 17:12 tabacha

I do have an example where denormalization would not be wanted. The geodata field contains place data such as business names. Thus a lookup in 2024 and 2030 would vary. So it's not so bad to have this duplicated. Perhaps a cache would be warranted, and the queue processing could order entries by lat/long to optimize cache usage.

lindner avatar Dec 10 '24 20:12 lindner

For those importing millions of points here's an optimization

  1. Remove the reverse_geocoding queue in the the /sidekiq interface
  2. Start the dawarich rails console. For me, I use the following docker command docker exec -it dawarich-app bin/rails console
  3. Execute the following query that finds points with more than 100 duplicates: p = Point.find_by_sql("select * from points where id in (select max(id) as id from points where city is null group by latitude,longitude having count(*) > 100)")
  4. Add these items to the reverse geocoding queue by executing the following: p.all.each do |r| r.async_reverse_geocode end
  5. Wait for the queue to empty
  6. Execute the update SQL above. (Remove the limit 1000 to update everything in one shot)
  7. Repeat for >10 and > 1 to efficiently process entries.

Hope this helps! Maybe add this as an FAQ or add some helper to do all of the above...

lindner avatar Dec 10 '24 21:12 lindner

@lindner thanks for diving into the topic! This issue is not missed or ignored, I'll pay it more attention later, hopefully soon. Meanwhile, thank you for providing options and suggesting another approach!

Freika avatar Dec 10 '24 21:12 Freika

I guess https://github.com/Freika/dawarich/pull/699 is a first step at tackling this.

tjorim avatar Jan 21 '25 10:01 tjorim