incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Problem] The shuffle server memory not release

Open wfxxh opened this issue 2 years ago • 11 comments

uniffle version: 0.6.0, I deploy on k8s,but the shuffle server memory not release ,even my spark application is stop it does not release too my server config is below:

rss.rpc.server.port 19999 rss.jetty.http.port 19998 rss.rpc.executor.size 2000 rss.storage.type MEMORY_LOCALFILE_HDFS rss.coordinator.quorum 10.100.41.162:19999 rss.server.disk.capacity 50g rss.storage.basePath /home/data rss.server.flush.thread.alive 1 rss.server.flush.threadPool.size 10 rss.server.buffer.capacity 4g rss.server.read.buffer.capacity 2g rss.server.heartbeat.timeout 60000 rss.server.heartbeat.interval 10000 rss.rpc.message.max.size 1073741824 rss.server.preAllocation.expired 120000 rss.server.commit.timeout 600000 rss.server.app.expired.withoutHeartbeat 120000 rss.server.flush.cold.storage.threshold.size 128m

image

wfxxh avatar Sep 20 '22 09:09 wfxxh

JVM can occupy the memory although they don't process any data.

jerqi avatar Sep 21 '22 02:09 jerqi

But when the memory is full, the shuffle server pod restart,this case my spark application faild

wfxxh avatar Sep 21 '22 02:09 wfxxh

Why do the shuffle server restart? There should be some information in the logs or stdout.

jerqi avatar Sep 21 '22 02:09 jerqi

It is restart by k8s, reason is memory is too high.I think if memory release ,it will not be appear

wfxxh avatar Sep 21 '22 02:09 wfxxh

It is restart by k8s, reason is memory is too high.I think if memory release ,it will not be appear

Maybe we should give more memory to the pod.

jerqi avatar Sep 21 '22 03:09 jerqi

It is 32G now,I can not give more

wfxxh avatar Sep 21 '22 05:09 wfxxh

You can adjust the parameter of memory in the bin/rss-env.sh and conf/server.conf.

jerqi avatar Sep 21 '22 06:09 jerqi

XMX_SIZE ? it is 30G now.

wfxxh avatar Sep 21 '22 06:09 wfxxh

XMX_SIZE ? it is 30G now.

Could you reduce the value?

jerqi avatar Sep 21 '22 07:09 jerqi

I have reduced it to 8G,but the pod restart too

wfxxh avatar Sep 21 '22 07:09 wfxxh

I have reduced it to 8G,but the pod restart too

Does the server restart because of the same reason? You give the pod 32G memory, XMX_SIZE is 8G, don't it?

jerqi avatar Sep 21 '22 09:09 jerqi