Look-Alike
Look-Alike copied to clipboard
Stack overflow for large number of rows
Pure recursion as it is written now creates a stack overflow for large number of rows.
Some thoughts: If there is a way to write it in a way that allows for Tail-call elimination, we could maybe use a trampoline
Last resort would be to rewrite the tree construction in an iterative way, which I hope not to do...
I'm thinking the issue might not necessarily Stack Overflow, but more a memory-leak while building the tree, because it's taking much longer than expected.. Then again, the error reads: Maximum call stack size exceeded
.. I've added some tests that cause the SO error. CC @flockonus @baugarten
You could use a selection algorithm to get O(n) median calculation, it would just eat up more memory... do you test the graph for cycles?
The depth of the tree is really large... larger than it should be >6000 for 196608, when it should be ~log_2(196608)
Median is always 0
Are you generating a very dense set of points? I think it just happens very often that a ton of the points have the same value for a dimension, and so the splits are bad
maybe this has to do with the standardization rounding to ints instead of floats?
Sorry for realtime debugging comments/email spam
Very interesting thoughts, thanks Ben! So you say the Median is always 0?? I'll start by looking into that...
So far:
- Median util is correctly returning a lot of 0 due to many duplicate points
- For many duplicate points (see stack overflow tests) the tree gets constructed as expected -- it may be arguably inefficient to allocate each object to a node, even though they are identical across all dimensions, and instead put them together in an array on 1 node -- this can get messy for querying, especially keeping in mind the filters... but may decrease the depth significantly, so might be worthwhile diving in deeper
- The Heap seems stable, even though the stack is overflowing .. debugging with memwatch showed that the Heap size increased only
2.31mb
while Real Mem usage in Activity Monitor showed an increase of roughly77 mb
(!)
Even though RSS skyrockets, the Heap stays stable -- perhaps not that big of a deal? Save to use --stack-size=50000
?
I think if we keep safely under 500Mb we should have no worries =)
Very crazy of an idea and not saying we should, BUT, how do you think Golang would behave?
In some large test-cases, building the tree shot up to 900MB of memory, due to incredible depth..
I went ahead and implemented my second point from above, which is to merge all nodes that are identical into one node -- gone with the stack overflows!:
Rows | Before | After
6k | 134ms | 97ms
12k | 480ms | 346ms
25k | 2380ms | 1240ms
50k | 9660ms | 3908ms
100k | SO | 13306ms
200k | SO | 47921ms
Memory is also very maintained around ~70MB
Some more benchmarking shows that the amount of tree that needs to be traversed explodes exponentially with more dimensions...
Moved query discussion to #2