incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Improvement] Disallow sendShuffleData if requireBufferId expired

Open xianjingfeng opened this issue 3 years ago • 6 comments

We found shuffle server which under high load is easy encounter java.lang.OutOfMemoryError: Java heap space even we allocate more jvm heap memory and less rss.server.buffer.capacity

The steps for the exception above:

  1. When shuffle server under high load, requireBufferId is easy to expire, and Shuffle server release usedMemory
  2. Client sendShuffleData using a expired requireBufferId,
  3. Shuffle server recive shuffle data and store in rpc queue(this part of memory usage was not be added to usedMemory)
  4. Other clients requireBuffer success because usedMemory is enough

xianjingfeng avatar Jul 26 '22 10:07 xianjingfeng

Do you have the solution of the problem?

jerqi avatar Jul 27 '22 03:07 jerqi

Yes, it is be testing in our production environment. I will watch it for a while. If it's OK, I will create a pr

xianjingfeng avatar Jul 27 '22 12:07 xianjingfeng

Could you share your solution? We can discuss first.

jerqi avatar Jul 28 '22 03:07 jerqi

Could you share your solution? We can discuss first.

  1. In server side, if requireBufferId not found when send data, thrown an exception.
  2. In client side, if fail to send data, require buffer again.

xianjingfeng avatar Jul 29 '22 13:07 xianjingfeng

cc @colinmjj . There seems not be cases in our production environment. But I think the analysis is correct. What do you think?

jerqi avatar Jul 31 '22 14:07 jerqi

I think @xianjingfeng is right, with current implementation, OOM will happen if requireBufferId was expired in Shuffle Server already, this maybe caused by GC, network problem, high workload in shuffle server etc. It's better to have the limitation to accept the data with requireBufferId only to avoid such problem.

colinmjj avatar Aug 01 '22 02:08 colinmjj

closed by #157

jerqi avatar Aug 22 '22 12:08 jerqi