dolphinscheduler
dolphinscheduler copied to clipboard
[Bug] [Worker] RemoteShell tasks may have memory leak issue
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
Version:3.2.0 jvm options: -Xmx3g -Xms3g -Xmn1g jdk:amazon-corretto-11.0.19.7.1-linux-x86_64 My project has 60 RemoteShell scheduled tasks(executing php commands). After running for a while, there are frequent Full GC occurrences, causing all tasks to fail, leading to false deadlocks on the worker nodes.So I had to change remoteshell to shell task which command using ssh -i id_rsa ''.
Apart from some error logs, I also noticed WARN logs with NPE (NullPointerException) occurring every time a task is executed.
[WARN] 2024-04-10 04:01:27.782 +0800 org.apache.sshd.client.session.ClientSessionImpl:[618] - [WorkflowInstance-0][TaskInstance-0] - exceptionCaught(ClientSessionImpl[root@/172.19.23.121:22])[state=Opened] NullPointerException: No customized heartbeat handler registered
here is error log:
[ERROR] 2024-04-10 04:01:01.146 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[181] - [WorkflowInstance-72475][TaskInstance-74145] - Task execute failed, due to meet an exception
org.apache.dolphinscheduler.plugin.task.api.TaskException: Execute shell task error
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:110)
at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.executeTask(DefaultWorkerDelayTaskExecuteRunnable.java:57)
at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:175)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: Remote shell task error
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:101)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:104)
... 9 common frames omitted
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:91)
... 10 common frames omitted
Caused by: org.apache.sshd.common.SshException: DefaultConnectFuture[root@/172.19.23.121:22]: Failed to get operation result within specified timeout: 5000
at org.apache.sshd.common.future.AbstractSshFuture.formatExceptionMessage(AbstractSshFuture.java:185)
at org.apache.sshd.common.future.AbstractSshFuture.verifyResult(AbstractSshFuture.java:111)
at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:42)
at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:34)
at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:42)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
... 13 common frames omitted
[INFO] 2024-04-10 04:01:02.874 +0800 org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask:[118] - [WorkflowInstance-72475][TaskInstance-74145] - kill remote task dolphinscheduler-remoteshell-74145
[ERROR] 2024-04-10 04:01:02.875 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[140] - [WorkflowInstance-72475][TaskInstance-74145] - Cancel task failed, this will not affect the taskInstance status, but you need to check manual
org.apache.dolphinscheduler.plugin.task.api.TaskException: cancel application error
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:121)
at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.cancelTask(WorkerTaskExecuteRunnable.java:136)
at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.afterThrowing(WorkerTaskExecuteRunnable.java:118)
at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.afterThrowing(DefaultWorkerDelayTaskExecuteRunnable.java:67)
at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:182)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.kill(RemoteExecutor.java:176)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:119)
... 11 common frames omitted
Caused by: java.lang.IllegalStateException: SshClient not started. Please call start() method before connecting to a server
at org.apache.sshd.client.SshClient.doConnect(SshClient.java:627)
at org.apache.sshd.client.SshClient.doConnect(SshClient.java:616)
at org.apache.sshd.client.SshClient.connect(SshClient.java:547)
at org.apache.sshd.client.SshClient.connect(SshClient.java:539)
at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:74)
at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:57)
at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:41)
at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
... 15 common frames omitted`
here is the snapshot of host's memory:
What you expected to happen
execute remoteshell tasks and has no memory leaks
How to reproduce
create remoteshell task,and schedules them in a short time
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Should be fixed by https://github.com/apache/dolphinscheduler/pull/15348
Should be finished by #15348
this issue you mentioned had been solved locally before. memory leak still exists
Have you close the RemoteExecutor?
// add task close method to release resource
try (RemoteExecutor executor = remoteExecutor) {
executor
yes,I have
here is the code
@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.
Should be finished by #15348
this issue you mentioned had been solved locally before. memory leak still exists
![]()
![]()
Please try to change the heartbeatType to IGNORE
session.setSessionHeartbeat(SessionHeartbeatController.HeartbeatType.IGNORE, Duration.ofSeconds(3));
@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.
I simulated it in the test environment. Executing the SSH command via shell works fine. However, when executing tasks via RemoteShell, the tasks will also get stuck after a while. Here are the stack info. stack_417.jstack.gz
@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.
I encountered the same problem #15812
Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu
Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu
Thank you, I'll try this approach.
Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu
Do I need to package and compile the dev branch code myself for testing?It may take some time
Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu
Do I need to package and compile the dev branch code myself for testing?It may take some time
Yes, or you can directly update your code.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.