risingwave bug: reschedule has much larger memory footprint than expected

Describe the bug

I've encountered several OOM when running a risectl scale vertical with lots of fragments included. For example, risectl scale vertical --fragments 1,2,3,4,5,6,7,8 can result in compute node OOM and subsequent reschedule failure. However risectl scale vertical --fragments 1,2,3,4 and then risectl scale vertical --fragments 5,6,7,8 can succeed.

I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.

Here are some numbers in my case: A compute node uses 100GB memory before rescheduling. During rescheduling it increases to 200GB unitl OOM.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

Dec 04 '23 02:12 zwang28

I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.

Agree. Scaling process only changes actors & metadata and does not involve data, so it's wierd that the memory usage doubled.

Thinking about how to reproduce this... How about run a scaling when running a longevity test?

Dec 05 '23 03:12 fuyufjh

Marked as priority/high because we have encountered this twice in the past week.

Dec 05 '23 03:12 fuyufjh

It appears that I cannot easily reproduce it on my machine. I've attempted to scale out a nexmark_q7 job from 2 parallelism to 20, involving 90 new actors and 1200 local exchange channels and found ~20 MB memory usage.

Approximated observations:

2M channels
2M update actors
2M tracing spans
2M vnode mapping
10M executor streams

Dec 07 '23 06:12 BugenZhao

With remote exchange, tonic decoding and prost encoding may occupy more memory, but the value still looks reasonable and they're mostly persistent:

Scale out a nexmark_q7 job from 2x3 parallelism to 20x3, involving 270 new actors and ~11000 exchange channels and found ~400 MB memory usage.

Dec 07 '23 06:12 BugenZhao

Since I didn't find large memory footprint but most of the allocations are persistent in my local reproduction, I'm wondering that...

if there's too many fragments or the parallelism is quite large in your case,
or if the query pattern matters, for example, there's a bug when some executor inits.

Dec 07 '23 06:12 BugenZhao

I haven't got the chance to get heap files from the original case. Will get one the next time.

Dec 19 '23 08:12 zwang28

Finally we have a heap file. Among the 150GB memory footprint of compute node, 30 GB are storage cache which is expect. The other two parts which adds up to 100GB are unexpected. Source is paused. @shanicky RW version 1.6.

2024-02-24-14-01-53.manual.heap.collapsed.gz

Feb 24 '24 14:02 zwang28

related to https://github.com/risingwavelabs/risingwave/issues/14533

Feb 26 '24 06:02 zwang28

risingwave risingwave copied to clipboard

bug: reschedule has much larger memory footprint than expected

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

risingwave
risingwave copied to clipboard