dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Bug] [Worker] RemoteShell tasks may have memory leak issue

Open simmonn opened this issue 10 months ago • 12 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

Version:3.2.0 jvm options: -Xmx3g -Xms3g -Xmn1g jdk:amazon-corretto-11.0.19.7.1-linux-x86_64 My project has 60 RemoteShell scheduled tasks(executing php commands). After running for a while, there are frequent Full GC occurrences, causing all tasks to fail, leading to false deadlocks on the worker nodes.So I had to change remoteshell to shell task which command using ssh -i id_rsa ''.

Apart from some error logs, I also noticed WARN logs with NPE (NullPointerException) occurring every time a task is executed. [WARN] 2024-04-10 04:01:27.782 +0800 org.apache.sshd.client.session.ClientSessionImpl:[618] - [WorkflowInstance-0][TaskInstance-0] - exceptionCaught(ClientSessionImpl[root@/172.19.23.121:22])[state=Opened] NullPointerException: No customized heartbeat handler registered

here is error log:

[ERROR] 2024-04-10 04:01:01.146 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[181] - [WorkflowInstance-72475][TaskInstance-74145] - Task execute failed, due to meet an exception
org.apache.dolphinscheduler.plugin.task.api.TaskException: Execute shell task error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:110)
        at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.executeTask(DefaultWorkerDelayTaskExecuteRunnable.java:57)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:175)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: Remote shell task error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:101)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:104)
        ... 9 common frames omitted
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:91)
        ... 10 common frames omitted
Caused by: org.apache.sshd.common.SshException: DefaultConnectFuture[root@/172.19.23.121:22]: Failed to get operation result within specified timeout: 5000
        at org.apache.sshd.common.future.AbstractSshFuture.formatExceptionMessage(AbstractSshFuture.java:185)
        at org.apache.sshd.common.future.AbstractSshFuture.verifyResult(AbstractSshFuture.java:111)
        at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:42)
        at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:34)
        at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:42)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
        ... 13 common frames omitted
[INFO] 2024-04-10 04:01:02.874 +0800 org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask:[118] - [WorkflowInstance-72475][TaskInstance-74145] - kill remote task dolphinscheduler-remoteshell-74145
[ERROR] 2024-04-10 04:01:02.875 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[140] - [WorkflowInstance-72475][TaskInstance-74145] - Cancel task failed, this will not affect the taskInstance status, but you need to check manual
org.apache.dolphinscheduler.plugin.task.api.TaskException: cancel application error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:121)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.cancelTask(WorkerTaskExecuteRunnable.java:136)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.afterThrowing(WorkerTaskExecuteRunnable.java:118)
        at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.afterThrowing(DefaultWorkerDelayTaskExecuteRunnable.java:67)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:182)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.kill(RemoteExecutor.java:176)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:119)
        ... 11 common frames omitted
Caused by: java.lang.IllegalStateException: SshClient not started. Please call start() method before connecting to a server
        at org.apache.sshd.client.SshClient.doConnect(SshClient.java:627)
        at org.apache.sshd.client.SshClient.doConnect(SshClient.java:616)
        at org.apache.sshd.client.SshClient.connect(SshClient.java:547)
        at org.apache.sshd.client.SshClient.connect(SshClient.java:539)
        at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:74)
        at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:57)
        at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:41)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
        ... 15 common frames omitted`

here is the snapshot of host's memory:

image

What you expected to happen

execute remoteshell tasks and has no memory leaks

How to reproduce

create remoteshell task,and schedules them in a short time

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

simmonn avatar Apr 15 '24 11:04 simmonn

Should be fixed by https://github.com/apache/dolphinscheduler/pull/15348

ruanwenjun avatar Apr 15 '24 15:04 ruanwenjun

Should be finished by #15348

this issue you mentioned had been solved locally before. memory leak still exists image image image

simmonn avatar Apr 16 '24 01:04 simmonn

Have you close the RemoteExecutor?

       // add task close method to release resource
         try (RemoteExecutor executor = remoteExecutor) {

ruanwenjun avatar Apr 16 '24 05:04 ruanwenjun

executor

yes,I have here is the code image

simmonn avatar Apr 16 '24 05:04 simmonn

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

ruanwenjun avatar Apr 16 '24 08:04 ruanwenjun

Should be finished by #15348

this issue you mentioned had been solved locally before. memory leak still exists image image image

Please try to change the heartbeatType to IGNORE

session.setSessionHeartbeat(SessionHeartbeatController.HeartbeatType.IGNORE, Duration.ofSeconds(3));

ruanwenjun avatar Apr 16 '24 09:04 ruanwenjun

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

I simulated it in the test environment. Executing the SSH command via shell works fine. However, when executing tasks via RemoteShell, the tasks will also get stuck after a while. Here are the stack info. stack_417.jstack.gz

simmonn avatar Apr 17 '24 02:04 simmonn

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

I encountered the same problem #15812

peak-xu avatar Apr 17 '24 02:04 peak-xu

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

ruanwenjun avatar Apr 17 '24 04:04 ruanwenjun

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Thank you, I'll try this approach.

simmonn avatar Apr 17 '24 05:04 simmonn

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Do I need to package and compile the dev branch code myself for testing?It may take some time

peak-xu avatar Apr 17 '24 06:04 peak-xu

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Do I need to package and compile the dev branch code myself for testing?It may take some time

Yes, or you can directly update your code.

ruanwenjun avatar Apr 17 '24 10:04 ruanwenjun

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar May 25 '24 00:05 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Jun 01 '24 00:06 github-actions[bot]