risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

bug: reschedule has much larger memory footprint than expected

Open zwang28 opened this issue 1 year ago • 8 comments

Describe the bug

I've encountered several OOM when running a risectl scale vertical with lots of fragments included. For example, risectl scale vertical --fragments 1,2,3,4,5,6,7,8 can result in compute node OOM and subsequent reschedule failure. However risectl scale vertical --fragments 1,2,3,4 and then risectl scale vertical --fragments 5,6,7,8 can succeed.

I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.

Here are some numbers in my case: A compute node uses 100GB memory before rescheduling. During rescheduling it increases to 200GB unitl OOM.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

zwang28 avatar Dec 04 '23 02:12 zwang28

I think the only additional memory required is for those new actors not started yet, which should be small in size. And as the source is paused during rescheduling, the memory footprint of old actors should remain stable. Also in my cases there are no source executors involved in rescheduling.

Agree. Scaling process only changes actors & metadata and does not involve data, so it's wierd that the memory usage doubled.

Thinking about how to reproduce this... How about run a scaling when running a longevity test?

fuyufjh avatar Dec 05 '23 03:12 fuyufjh

Marked as priority/high because we have encountered this twice in the past week.

fuyufjh avatar Dec 05 '23 03:12 fuyufjh

It appears that I cannot easily reproduce it on my machine. I've attempted to scale out a nexmark_q7 job from 2 parallelism to 20, involving 90 new actors and 1200 local exchange channels and found ~20 MB memory usage.

Approximated observations:

  • 2M channels
  • 2M update actors
  • 2M tracing spans
  • 2M vnode mapping
  • 10M executor streams

BugenZhao avatar Dec 07 '23 06:12 BugenZhao

With remote exchange, tonic decoding and prost encoding may occupy more memory, but the value still looks reasonable and they're mostly persistent:

Scale out a nexmark_q7 job from 2x3 parallelism to 20x3, involving 270 new actors and ~11000 exchange channels and found ~400 MB memory usage.

BugenZhao avatar Dec 07 '23 06:12 BugenZhao

Since I didn't find large memory footprint but most of the allocations are persistent in my local reproduction, I'm wondering that...

  • if there's too many fragments or the parallelism is quite large in your case,
  • or if the query pattern matters, for example, there's a bug when some executor inits.

BugenZhao avatar Dec 07 '23 06:12 BugenZhao

I haven't got the chance to get heap files from the original case. Will get one the next time.

zwang28 avatar Dec 19 '23 08:12 zwang28

Finally we have a heap file. Among the 150GB memory footprint of compute node, 30 GB are storage cache which is expect. The other two parts which adds up to 100GB are unexpected. Source is paused. @shanicky RW version 1.6.

2024-02-24-14-01-53.manual.heap.collapsed.gz

zwang28 avatar Feb 24 '24 14:02 zwang28

related to https://github.com/risingwavelabs/risingwave/issues/14533

zwang28 avatar Feb 26 '24 06:02 zwang28