wheretz Split multipolygons in source dataset into separate polygon files with separate bounding boxes to improve perf

Hi @zverok, really appreciate the work you've done with this gem and wanted to say thank you. I hope you are safe in these crazy times 🙏 💙 💛

I've done some work to improve performance of the multipolygon data that some timezones contain. By splitting the geojson features into single polygons the search algorithm you created can be much faster (2x or so). The 2020d data I believe had more multipolygon data, which meant it slowed things down. There is a slight increase in parsing 3.5x the number of files to find the matching bounding boxes, but this is offset by the reduction in cost from having to do the contains_point? calcs on many more polygons than actually needed.

Running benchmarking on Ruby 3 with Benchmarkbmbm to minimise any impact of garbage collection

WhereTZ 0.0.5

Benchmark.bmbm do |x|
    x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
end

Rehearsal -------------------------------------------
lookup!   9.186424   0.246790   9.433214 (  9.641439)
---------------------------------- total: 9.433214sec

              user     system      total        real
lookup!   8.874266   0.200417   9.074683 (  9.123046)
  => [#<Benchmark::Tms:0x0000000104add568 @label="lookup!", @real=9.123046000022441, @cstime=0.0, @cutime=0.0, @stime=0.2004170000000003, @utime=8.874265999999999, @total=9.074682999999999>]

WhereTZ 0.0.6

  Benchmark.bmbm do |x|
      x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
  end
  
Rehearsal -------------------------------------------
lookup!  12.480392   0.325768  12.806160 ( 12.917325)
--------------------------------- total: 12.806160sec

              user     system      total        real
lookup!  12.337994   0.275937  12.613931 ( 12.673767)
 => [#<Benchmark::Tms:0x000000010b556958 @label="lookup!", @real=12.673766999971122, @cstime=0.0, @cutime=0.0, @stime=0.2759369999999999, @utime=12.337993999999998, @total=12.613930999999997>]

WhereTZ NikoRoberts fork with oceans

Benchmark.bmbm do |x|
  x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180), include_oceans: true) rescue ArgumentError; end } }
end

Rehearsal -------------------------------------------
lookup!   7.063934   0.175196   7.239130 (  7.415430)
---------------------------------- total: 7.239130sec

              user     system      total        real
lookup!   6.165460   0.109725   6.275185 (  6.336776)
 => [#<Benchmark::Tms:0x000000012415d0b8 @label="lookup!", @real=6.33677599998191, @cstime=0.0, @cutime=0.0, @stime=0.10972500000000007, @utime=6.165460000000024, @total=6.275185000000024>]

WhereTZ NikoRoberts fork without oceans

Benchmark.bmbm do |x|
  x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
end

Rehearsal -------------------------------------------
lookup!   5.235642   0.137277   5.372919 (  5.437075)
---------------------------------- total: 5.372919sec

              user     system      total        real
lookup!   4.379040   0.109574   4.488614 (  4.522984)
 => [#<Benchmark::Tms:0x000000011987d060 @label="lookup!", @real=4.522983999922872, @cstime=0.0, @cutime=0.0, @stime=0.10957400000000006, @utime=4.37904, @total=4.488614>]

Mar 08 '22 04:03 NikoRoberts

@zverok seems like a good pr to me. @NikoRoberts are you running this in prod?

Jul 14 '22 19:07 bf4

I am interested in seeing this update as well :+1:

Feb 01 '23 21:02 HarlemSquirrel

Did a slightly different benchmark using predictable coordinates instead of random. Nice performance improvement!

iterations = 1000
Benchmark.bmbm do |x|
  x.report("lookup!") do
    iterations.times do |i|
        lat = (75 - (150.0/i))
        lng = (150 - (300.0/i))
        WhereTZ.lookup(lat, lng) rescue ArgumentError

        lat = (-75 + (150.0/i))
        lng = (150 - (300.0/i))
        WhereTZ.lookup(lat, lng) rescue ArgumentError

        lat = (75 - (150.0/i))
        lng = (-150 + (300.0/i))
        WhereTZ.lookup(lat, lng) rescue ArgumentError
    end
  end
end

v0.0.5
Rehearsal -------------------------------------------
lookup!  12.000930   0.208663  12.209593 ( 12.210856)
--------------------------------- total: 12.209593sec

              user     system      total        real
lookup!  12.000062   0.123985  12.124047 ( 12.125724)

v0.0.6
Rehearsal -------------------------------------------
lookup!  12.132687   0.148525  12.281212 ( 12.294039)
--------------------------------- total: 12.281212sec

              user     system      total        real
lookup!  12.349960   0.115950  12.465910 ( 12.472220)

NikoRoberts:master
Rehearsal -------------------------------------------
lookup!   2.925792   0.032123   2.957915 (  2.959522)
---------------------------------- total: 2.957915sec

              user     system      total        real
lookup!   3.007141   0.032116   3.039257 (  3.043532)

Feb 14 '23 16:02 HarlemSquirrel

Unfortunately, I should aknowledge that I have no resources currently to support this gem. (Proper support will require at least updating at every timezone-boundary-builder data update, which haven't been done for quite some time already.)

I am not using it myself currently, so it is very hard for me to estimate the consequences of this PR properly (whether "skipping the oceans in most cases" is sensible enough, whether having 1400 files is acceptable etc.)

In addition, I currently believe that the approach with "read from many JSON files" is childishly naive, there probably should be better ways. Say, TimezoneFinder (inherited from Python and using singular binary data file of 19 MB) performs this way on my machine, compared with the current WhereTZ (using the same benchmark code @HarlemSquirrel provided above):

                     user     system      total        real
WhereTZ          0.252163   0.007784   0.259947 (  0.260101)
TimezoneFinder   0.010709   0.000162   0.010871 (  0.010882)

So... Sorry to disappoint you, and thanks for all the hard work, but that's all I can say as of now :shrug:

Feb 19 '23 20:02 zverok

Timezonefinder also removes oceans and Antarctica timezones.

There are "shortcuts" in the binary data, which I guess are manually maintained bounding boxes around areas in the same time zone or something along those lines. The same approach could be used with the file approach.

If we added simplified bounding boxes that were searched first, we could probably get similar performance for the vast majority of coordinates.

Feb 19 '23 22:02 NikoRoberts

I threw together a gem that wraps a Rust implementation that seems to be quite a bit faster: https://github.com/HarlemSquirrel/tzf-rb

https://gist.github.com/HarlemSquirrel/701f83783486794001a75cb8cf5576ba Benchmarked with WhereTZ v0.0.6

Rehearsal --------------------------------------------------
WhereTZ.lookup  13.960516   0.140053  14.100569 ( 14.102781)
TZF.tz_name      0.044170   0.019952   0.064122 (  0.064249)
---------------------------------------- total: 14.164691sec

                     user     system      total        real
WhereTZ.lookup  13.182494   0.163981  13.346475 ( 13.348239)
TZF.tz_name      0.008186   0.000002   0.008188 (  0.008196)

Feb 21 '23 22:02 HarlemSquirrel