dolphinscheduler
dolphinscheduler copied to clipboard
[Bug] [Master-server 3.2.1] Host of task instance is null appears in version 3.2.1
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
3.1.9版本升级到3.2.1后,发现任务实例一天会出现5 6条调度失败的任务,通过排查,发现调度任务时,task_instance表的host都是null, 如下图:
打印日志如下:
[INFO] 2024-02-22 03:00:16.737 -0500 o.a.d.a.s.i.SchedulerServiceImpl:[809] - Schedule update complete, projectCode:12669712879296, processDefinitionCode:12670706423502, scheduleId:2. [INFO] 2024-02-22 03:00:18.898 -0500 o.a.d.s.q.QuartzScheduler:[104] - Add job, job name: job_2, group name: jobgroup_1 [INFO] 2024-02-22 03:00:18.908 -0500 o.a.d.s.q.QuartzScheduler:[137] - schedule job trigger, triggerName: job_2, triggerGroupName: jobgroup_1, cronExpression: 30 0/1 * * * ? *, startDate: Thu Feb 22 03:00:18 EST 2024, endDate: Fri Jan 30 02:00:00 EST 2224 [ERROR] 2024-02-22 03:00:42.551 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:01:22.126 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:01:53.114 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:02:22.820 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:03:08.621 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693.
What you expected to happen
不希望每天出现几条调度失败任务
How to reproduce
跑了3天的3.2.1版本任务,每天都会出现,通过数据库task_instance表,有空的host就是这个问题
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
After upgrading from version 3.1.9 to 3.2.1, it was found that 5 or 6 tasks that failed to be scheduled appeared in the task instance a day. Through investigation, it was found that when scheduling tasks, the host of the task_instance table was null, as shown below:
The print log is as follows:
[INFO] 2024-02-22 03:00:16.737 -0500 o.a.d.a.s.i.SchedulerServiceImpl:[809] - Schedule update complete, projectCode:12669712879296, processDefinitionCode:12670706423502, scheduleId:2. [INFO] 2024-02-22 03:00:18.898 -0500 o.a.d.s.q.QuartzScheduler:[104] - Add job, job name: job_2, group name: jobgroup_1 [INFO] 2024-02-22 03:00:18.908 -0500 o.a.d.s.q.QuartzScheduler:[137] - schedule job trigger, triggerName: job_2, triggerGroupName: jobgroup_1, cronExpression: 30 0/1 * * * ? *, startDate: Thu Feb 22 03:00:18 EST 2024, endDate: Fri Jan 30 02:00:00 EST 2224 [ERROR] 2024-02-22 03:00:42.551 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:01:22.126 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:01:53.114 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:02:22.820 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:03:08.621 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693.
What you expected to happen
I don’t want to have several failed scheduling tasks every day.
How to reproduce
The 3.2.1 version task that has been running for 3 days appears every day. Through the database task_instance table, this is the problem with the available hosts.
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
+1,3.1.9升级到3.2.0出现这个情况,升级到3.2.1也出现这个情况,而且我的情况更多,希望能帮忙看看,为什么这么多task instance host is null !!!!
经过这2天测试,发现任务实例主机为空的情况造成原因: 我采用的物理机集群部署,输入时区指令:timedatectl 然后物理机的时区是new_york,然后马上调整时区为Asia/Shanghai (CST, +0800), 然后重启服务,发现问题解决了,如图:
@zhongjiajie
如上图,dolphinscheduler_env.sh ,环境变量区值为GMT+8,这个没问题,但是由于服务器本身时区问题,导致少了13个小时
+1,3.1.9升级到3.2.0出现这个情况,升级到3.2.1也出现这个情况,而且我的情况更多,希望能帮忙看看,为什么这么多task instance host is null !!!!
更新:经过多次尝试和跟踪发现,由于任务瞬发内存占用过高导致ds相关服务异常,以至于后续容错恢复或重跑都无法正常执行,且会多次出现task instance host is null 的情况,进而导致任务失败过多。
3.2.0 has the same problem
后面服务重启了1次,又出现这种情况了。。。
我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够
我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够
遇到了同样的问题,最后怎么解决的?加worker机器么
我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够
遇到了同样的问题,最后怎么解决的?加worker机器么
是的,我们是放大了master的最大堆内存,减少full gc,并且增加了worker节点,就解决了
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.