incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Improvement] Limit the speed of memory release when drop pending events

Open xianjingfeng opened this issue 2 years ago • 11 comments

In org.apache.uniffle.server.ShuffleFlushManager#processPendingEvents,OOM will happen if a large number of events need to be dropped, because usedMemory release immediately, but the speed of GC is not fast enough.

xianjingfeng avatar Aug 18 '22 08:08 xianjingfeng

I'm curious about how to find OOM is caused by dropped events.

zuston avatar Aug 19 '22 09:08 zuston

screenshot

xianjingfeng avatar Aug 19 '22 09:08 xianjingfeng

screenshot2

xianjingfeng avatar Aug 19 '22 09:08 xianjingfeng

https://stackoverflow.com/questions/8719071/is-it-possible-to-get-outofmemoryerror-because-garbage-collection-too-slow I'm not sure the effect of pr.

jerqi avatar Aug 19 '22 09:08 jerqi

The purpose of this PR is to reserve more time for GC and reduce the speed of receiving data. Maybe this is not a good way, but I feel it is effective.

xianjingfeng avatar Aug 19 '22 10:08 xianjingfeng

From my understanding, if memory is not enough, it will firstly trigger GC to remove unused objects. This will stop the world.

After that, when memory is still not enough, it will OOM.

What do u think so?

zuston avatar Aug 19 '22 13:08 zuston

When to trigger GC? The direct reason for GC is that it is written too fast. Maybe it will be solved after replace grpc with netty in the future.

xianjingfeng avatar Aug 19 '22 13:08 xianjingfeng

@zuston I misunderstood what you said just now. You are right, but GC will not delete all unused objects at once.

xianjingfeng avatar Aug 19 '22 13:08 xianjingfeng

Could you explain the two diagrams? I can't get your point.

jerqi avatar Aug 22 '22 03:08 jerqi

When total_dropped_event_num rises rapidly, jvm_memory_bytes_used rises rapidly at the same time. Does it means that it is triggered by processing pending events?

xianjingfeng avatar Aug 22 '22 06:08 xianjingfeng

Em......If we can't write data to storage, we will occupy the more memory. It's normal. But the two diagrams can't persuade me that the gc speed lead to OOM.

jerqi avatar Aug 22 '22 14:08 jerqi