risingwave
risingwave copied to clipboard
bug: reschedule has much larger memory footprint than expected
Describe the bug
I've encountered several OOM when running a risectl scale vertical
with lots of fragments included.
For example, risectl scale vertical --fragments 1,2,3,4,5,6,7,8
can result in compute node OOM and subsequent reschedule failure. However risectl scale vertical --fragments 1,2,3,4
and then risectl scale vertical --fragments 5,6,7,8
can succeed.
I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.
Here are some numbers in my case: A compute node uses 100GB memory before rescheduling. During rescheduling it increases to 200GB unitl OOM.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.
Agree. Scaling process only changes actors & metadata and does not involve data, so it's wierd that the memory usage doubled.
Thinking about how to reproduce this... How about run a scaling when running a longevity test?
Marked as priority/high
because we have encountered this twice in the past week.
It appears that I cannot easily reproduce it on my machine. I've attempted to scale out a nexmark_q7
job from 2 parallelism to 20, involving 90 new actors and 1200 local exchange channels and found ~20 MB memory usage.
Approximated observations:
- 2M channels
- 2M update actors
- 2M tracing spans
- 2M vnode mapping
- 10M executor streams
With remote exchange, tonic decoding and prost encoding may occupy more memory, but the value still looks reasonable and they're mostly persistent:
Scale out a nexmark_q7 job from 2x3 parallelism to 20x3, involving 270 new actors and ~11000 exchange channels and found ~400 MB memory usage.
Since I didn't find large memory footprint but most of the allocations are persistent in my local reproduction, I'm wondering that...
- if there's too many fragments or the parallelism is quite large in your case,
- or if the query pattern matters, for example, there's a bug when some executor inits.
I haven't got the chance to get heap files from the original case. Will get one the next time.
Finally we have a heap file. Among the 150GB memory footprint of compute node, 30 GB are storage cache which is expect. The other two parts which adds up to 100GB are unexpected. Source is paused. @shanicky RW version 1.6.
related to https://github.com/risingwavelabs/risingwave/issues/14533