Split multipolygons in source dataset into separate polygon files with separate bounding boxes to improve perf
Hi @zverok, really appreciate the work you've done with this gem and wanted to say thank you. I hope you are safe in these crazy times 🙏 💙 💛
I've done some work to improve performance of the multipolygon data that some timezones contain. By splitting the geojson features into single polygons the search algorithm you created can be much faster (2x or so). The 2020d data I believe had more multipolygon data, which meant it slowed things down. There is a slight increase in parsing 3.5x the number of files to find the matching bounding boxes, but this is offset by the reduction in cost from having to do the contains_point? calcs on many more polygons than actually needed.
Running benchmarking on Ruby 3 with Benchmarkbmbm to minimise any impact of garbage collection
WhereTZ 0.0.5
Benchmark.bmbm do |x|
x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
end
Rehearsal -------------------------------------------
lookup! 9.186424 0.246790 9.433214 ( 9.641439)
---------------------------------- total: 9.433214sec
user system total real
lookup! 8.874266 0.200417 9.074683 ( 9.123046)
=> [#<Benchmark::Tms:0x0000000104add568 @label="lookup!", @real=9.123046000022441, @cstime=0.0, @cutime=0.0, @stime=0.2004170000000003, @utime=8.874265999999999, @total=9.074682999999999>]
WhereTZ 0.0.6
Benchmark.bmbm do |x|
x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
end
Rehearsal -------------------------------------------
lookup! 12.480392 0.325768 12.806160 ( 12.917325)
--------------------------------- total: 12.806160sec
user system total real
lookup! 12.337994 0.275937 12.613931 ( 12.673767)
=> [#<Benchmark::Tms:0x000000010b556958 @label="lookup!", @real=12.673766999971122, @cstime=0.0, @cutime=0.0, @stime=0.2759369999999999, @utime=12.337993999999998, @total=12.613930999999997>]
WhereTZ NikoRoberts fork with oceans
Benchmark.bmbm do |x|
x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180), include_oceans: true) rescue ArgumentError; end } }
end
Rehearsal -------------------------------------------
lookup! 7.063934 0.175196 7.239130 ( 7.415430)
---------------------------------- total: 7.239130sec
user system total real
lookup! 6.165460 0.109725 6.275185 ( 6.336776)
=> [#<Benchmark::Tms:0x000000012415d0b8 @label="lookup!", @real=6.33677599998191, @cstime=0.0, @cutime=0.0, @stime=0.10972500000000007, @utime=6.165460000000024, @total=6.275185000000024>]
WhereTZ NikoRoberts fork without oceans
Benchmark.bmbm do |x|
x.report("lookup!") { (0..1000).each { begin; WhereTZ.lookup(rand(-90..90), rand(-180..180)) rescue ArgumentError; end } }
end
Rehearsal -------------------------------------------
lookup! 5.235642 0.137277 5.372919 ( 5.437075)
---------------------------------- total: 5.372919sec
user system total real
lookup! 4.379040 0.109574 4.488614 ( 4.522984)
=> [#<Benchmark::Tms:0x000000011987d060 @label="lookup!", @real=4.522983999922872, @cstime=0.0, @cutime=0.0, @stime=0.10957400000000006, @utime=4.37904, @total=4.488614>]
@zverok seems like a good pr to me. @NikoRoberts are you running this in prod?
I am interested in seeing this update as well :+1:
Did a slightly different benchmark using predictable coordinates instead of random. Nice performance improvement!
iterations = 1000
Benchmark.bmbm do |x|
x.report("lookup!") do
iterations.times do |i|
lat = (75 - (150.0/i))
lng = (150 - (300.0/i))
WhereTZ.lookup(lat, lng) rescue ArgumentError
lat = (-75 + (150.0/i))
lng = (150 - (300.0/i))
WhereTZ.lookup(lat, lng) rescue ArgumentError
lat = (75 - (150.0/i))
lng = (-150 + (300.0/i))
WhereTZ.lookup(lat, lng) rescue ArgumentError
end
end
end
v0.0.5
Rehearsal -------------------------------------------
lookup! 12.000930 0.208663 12.209593 ( 12.210856)
--------------------------------- total: 12.209593sec
user system total real
lookup! 12.000062 0.123985 12.124047 ( 12.125724)
v0.0.6
Rehearsal -------------------------------------------
lookup! 12.132687 0.148525 12.281212 ( 12.294039)
--------------------------------- total: 12.281212sec
user system total real
lookup! 12.349960 0.115950 12.465910 ( 12.472220)
NikoRoberts:master
Rehearsal -------------------------------------------
lookup! 2.925792 0.032123 2.957915 ( 2.959522)
---------------------------------- total: 2.957915sec
user system total real
lookup! 3.007141 0.032116 3.039257 ( 3.043532)
Unfortunately, I should aknowledge that I have no resources currently to support this gem. (Proper support will require at least updating at every timezone-boundary-builder data update, which haven't been done for quite some time already.)
I am not using it myself currently, so it is very hard for me to estimate the consequences of this PR properly (whether "skipping the oceans in most cases" is sensible enough, whether having 1400 files is acceptable etc.)
In addition, I currently believe that the approach with "read from many JSON files" is childishly naive, there probably should be better ways. Say, TimezoneFinder (inherited from Python and using singular binary data file of 19 MB) performs this way on my machine, compared with the current WhereTZ (using the same benchmark code @HarlemSquirrel provided above):
user system total real
WhereTZ 0.252163 0.007784 0.259947 ( 0.260101)
TimezoneFinder 0.010709 0.000162 0.010871 ( 0.010882)
So... Sorry to disappoint you, and thanks for all the hard work, but that's all I can say as of now :shrug:
Timezonefinder also removes oceans and Antarctica timezones.
There are "shortcuts" in the binary data, which I guess are manually maintained bounding boxes around areas in the same time zone or something along those lines. The same approach could be used with the file approach.
If we added simplified bounding boxes that were searched first, we could probably get similar performance for the vast majority of coordinates.
I threw together a gem that wraps a Rust implementation that seems to be quite a bit faster: https://github.com/HarlemSquirrel/tzf-rb
https://gist.github.com/HarlemSquirrel/701f83783486794001a75cb8cf5576ba Benchmarked with WhereTZ v0.0.6
Rehearsal --------------------------------------------------
WhereTZ.lookup 13.960516 0.140053 14.100569 ( 14.102781)
TZF.tz_name 0.044170 0.019952 0.064122 ( 0.064249)
---------------------------------------- total: 14.164691sec
user system total real
WhereTZ.lookup 13.182494 0.163981 13.346475 ( 13.348239)
TZF.tz_name 0.008186 0.000002 0.008188 ( 0.008196)