OfflineReverseGeocode Very skewed distribution of duration on large datasets

Very skewed distribution of duration on large datasets

Open franc00018 opened this issue 9 years ago • 4 comments

Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	10 s	1.2 min	6.2 min	12 min	51 min
GC Time	0.2 s	2 s	4 s	4 s	10 s
Input	25.5 MB	128.1 MB	128.1 MB	128.1 MB	128.1 MB
Output	18.4 MB	93.3 MB	93.6 MB	93.7 MB	94.0 MB

I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.

Sep 21 '15 17:09 franc00018

The average complexity of the kd-tree, O(logn), should be what we see since it's a large dataset (barring some bug). It would be good to narrow down further (halve the datasets, take the slowest half and repeat). Perhaps there's a set of lookups that cause a particular issue.

I suspect this could be an old fashion memory caching issue. One thing i would try is to sort the datasets by location to some degree first. Points near each other in the real world will also traverse the same parts of the KD-Tree.

Sep 21 '15 23:09 AReallyGoodName

I have made the observation that if the points I search for is within the area covered by the points in K-D tree everything goes smoothly but when I search for nearest from a set of points well outside the covered area the search is significantly slower.

Example Number of points in KDtree: 749063 Time for finding 1000 nearest when points are well outside K-d tree area: 29 seconds Time for finding 1000 nearest when points are in the middle of the K-d tree area: less than a second

Sep 22 '15 08:09 hoerup

@franc00018 - did you face any serialisation issues while using it with spark ?

Jan 18 '17 09:01 kutt4n

Not of what I remember. My code completed successfully. It was on Spark 1.2.1. I unfortunately don't have access to the data anymore so I can't easily reproduce.

Jan 24 '17 04:01 franc00018

OfflineReverseGeocode OfflineReverseGeocode copied to clipboard

Very skewed distribution of duration on large datasets

OfflineReverseGeocode
OfflineReverseGeocode copied to clipboard