Look-Alike icon indicating copy to clipboard operation
Look-Alike copied to clipboard

Stack overflow for large number of rows

Open mck- opened this issue 11 years ago • 12 comments

Pure recursion as it is written now creates a stack overflow for large number of rows.

Some thoughts: If there is a way to write it in a way that allows for Tail-call elimination, we could maybe use a trampoline

Last resort would be to rewrite the tree construction in an iterative way, which I hope not to do...

mck- avatar Aug 30 '13 18:08 mck-

I'm thinking the issue might not necessarily Stack Overflow, but more a memory-leak while building the tree, because it's taking much longer than expected.. Then again, the error reads: Maximum call stack size exceeded .. I've added some tests that cause the SO error. CC @flockonus @baugarten

mck- avatar Aug 30 '13 19:08 mck-

You could use a selection algorithm to get O(n) median calculation, it would just eat up more memory... do you test the graph for cycles?

baugarten avatar Aug 30 '13 20:08 baugarten

The depth of the tree is really large... larger than it should be >6000 for 196608, when it should be ~log_2(196608)

baugarten avatar Aug 30 '13 20:08 baugarten

Median is always 0

baugarten avatar Aug 30 '13 20:08 baugarten

Are you generating a very dense set of points? I think it just happens very often that a ton of the points have the same value for a dimension, and so the splits are bad

baugarten avatar Aug 30 '13 20:08 baugarten

maybe this has to do with the standardization rounding to ints instead of floats?

Sorry for realtime debugging comments/email spam

baugarten avatar Aug 30 '13 20:08 baugarten

Very interesting thoughts, thanks Ben! So you say the Median is always 0?? I'll start by looking into that...

mck- avatar Aug 30 '13 21:08 mck-

So far:

  • Median util is correctly returning a lot of 0 due to many duplicate points
  • For many duplicate points (see stack overflow tests) the tree gets constructed as expected -- it may be arguably inefficient to allocate each object to a node, even though they are identical across all dimensions, and instead put them together in an array on 1 node -- this can get messy for querying, especially keeping in mind the filters... but may decrease the depth significantly, so might be worthwhile diving in deeper
  • The Heap seems stable, even though the stack is overflowing .. debugging with memwatch showed that the Heap size increased only 2.31mb while Real Mem usage in Activity Monitor showed an increase of roughly 77 mb (!)

screen shot 2013-09-03 at 11 51 23 am

mck- avatar Sep 03 '13 18:09 mck-

Even though RSS skyrockets, the Heap stays stable -- perhaps not that big of a deal? Save to use --stack-size=50000?

mck- avatar Sep 03 '13 21:09 mck-

I think if we keep safely under 500Mb we should have no worries =)

Very crazy of an idea and not saying we should, BUT, how do you think Golang would behave?

flockonus avatar Sep 04 '13 16:09 flockonus

In some large test-cases, building the tree shot up to 900MB of memory, due to incredible depth..

I went ahead and implemented my second point from above, which is to merge all nodes that are identical into one node -- gone with the stack overflows!:

Rows | Before | After
6k   | 134ms  | 97ms
12k  | 480ms  | 346ms
25k  | 2380ms | 1240ms
50k  | 9660ms | 3908ms    
100k | SO     | 13306ms
200k | SO     | 47921ms

Memory is also very maintained around ~70MB

mck- avatar Sep 04 '13 21:09 mck-

Some more benchmarking shows that the amount of tree that needs to be traversed explodes exponentially with more dimensions...

screen shot 2013-09-06 at 10 19 48 am

Moved query discussion to #2

mck- avatar Sep 06 '13 17:09 mck-