incubator-uniffle
incubator-uniffle copied to clipboard
[Improvement] Disallow sendShuffleData if requireBufferId expired
We found shuffle server which under high load is easy encounter java.lang.OutOfMemoryError: Java heap space even we allocate more jvm heap memory and less rss.server.buffer.capacity
The steps for the exception above:
- When shuffle server under high load,
requireBufferIdis easy to expire, and Shuffle server releaseusedMemory - Client
sendShuffleDatausing a expiredrequireBufferId, - Shuffle server recive shuffle data and store in rpc queue(this part of memory usage was not be added to
usedMemory) - Other clients
requireBuffersuccess becauseusedMemoryis enough
Do you have the solution of the problem?
Yes, it is be testing in our production environment. I will watch it for a while. If it's OK, I will create a pr
Could you share your solution? We can discuss first.
Could you share your solution? We can discuss first.
- In server side, if
requireBufferIdnot found when send data, thrown an exception. - In client side, if fail to send data, require buffer again.
cc @colinmjj . There seems not be cases in our production environment. But I think the analysis is correct. What do you think?
I think @xianjingfeng is right, with current implementation, OOM will happen if requireBufferId was expired in Shuffle Server already, this maybe caused by GC, network problem, high workload in shuffle server etc.
It's better to have the limitation to accept the data with requireBufferId only to avoid such problem.
closed by #157