dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Bug] [Master-server 3.2.1] Host of task instance is null appears in version 3.2.1

Open liubo988 opened this issue 1 year ago • 12 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

3.1.9版本升级到3.2.1后,发现任务实例一天会出现5 6条调度失败的任务,通过排查,发现调度任务时,task_instance表的host都是null, 如下图: bug12 bug13 bug14

bug11

打印日志如下:

[INFO] 2024-02-22 03:00:16.737 -0500 o.a.d.a.s.i.SchedulerServiceImpl:[809] - Schedule update complete, projectCode:12669712879296, processDefinitionCode:12670706423502, scheduleId:2. [INFO] 2024-02-22 03:00:18.898 -0500 o.a.d.s.q.QuartzScheduler:[104] - Add job, job name: job_2, group name: jobgroup_1 [INFO] 2024-02-22 03:00:18.908 -0500 o.a.d.s.q.QuartzScheduler:[137] - schedule job trigger, triggerName: job_2, triggerGroupName: jobgroup_1, cronExpression: 30 0/1 * * * ? *, startDate: Thu Feb 22 03:00:18 EST 2024, endDate: Fri Jan 30 02:00:00 EST 2224 [ERROR] 2024-02-22 03:00:42.551 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:01:22.126 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:01:53.114 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:02:22.820 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:03:08.621 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693.

What you expected to happen

不希望每天出现几条调度失败任务

How to reproduce

跑了3天的3.2.1版本任务,每天都会出现,通过数据库task_instance表,有空的host就是这个问题

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

liubo988 avatar Feb 22 '24 08:02 liubo988

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

After upgrading from version 3.1.9 to 3.2.1, it was found that 5 or 6 tasks that failed to be scheduled appeared in the task instance a day. Through investigation, it was found that when scheduling tasks, the host of the task_instance table was null, as shown below: bug12 bug13 bug14

bug11

The print log is as follows:

[INFO] 2024-02-22 03:00:16.737 -0500 o.a.d.a.s.i.SchedulerServiceImpl:[809] - Schedule update complete, projectCode:12669712879296, processDefinitionCode:12670706423502, scheduleId:2. [INFO] 2024-02-22 03:00:18.898 -0500 o.a.d.s.q.QuartzScheduler:[104] - Add job, job name: job_2, group name: jobgroup_1 [INFO] 2024-02-22 03:00:18.908 -0500 o.a.d.s.q.QuartzScheduler:[137] - schedule job trigger, triggerName: job_2, triggerGroupName: jobgroup_1, cronExpression: 30 0/1 * * * ? *, startDate: Thu Feb 22 03:00:18 EST 2024, endDate: Fri Jan 30 02:00:00 EST 2224 [ERROR] 2024-02-22 03:00:42.551 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:01:22.126 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:01:53.114 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120691. [ERROR] 2024-02-22 03:02:22.820 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693. [ERROR] 2024-02-22 03:03:08.621 -0500 o.a.d.a.s.i.LoggerServiceImpl:[96] - Host of task instance is null, taskInstanceId:120693.

What you expected to happen

I don’t want to have several failed scheduling tasks every day.

How to reproduce

The 3.2.1 version task that has been running for 3 days appears every day. Through the database task_instance table, this is the problem with the available hosts.

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

github-actions[bot] avatar Feb 22 '24 08:02 github-actions[bot]

Uploading bug14.png…

liubo988 avatar Feb 22 '24 08:02 liubo988

+1,3.1.9升级到3.2.0出现这个情况,升级到3.2.1也出现这个情况,而且我的情况更多,希望能帮忙看看,为什么这么多task instance host is null !!!!

z0L1n avatar Feb 23 '24 06:02 z0L1n

经过这2天测试,发现任务实例主机为空的情况造成原因: 我采用的物理机集群部署,输入时区指令:timedatectl 然后物理机的时区是new_york,然后马上调整时区为Asia/Shanghai (CST, +0800), 然后重启服务,发现问题解决了,如图: 企业微信截图_17096115034609 @zhongjiajie

liubo988 avatar Mar 06 '24 01:03 liubo988

企业微信截图_17096900323858 如上图,dolphinscheduler_env.sh ,环境变量区值为GMT+8,这个没问题,但是由于服务器本身时区问题,导致少了13个小时

liubo988 avatar Mar 06 '24 02:03 liubo988

+1,3.1.9升级到3.2.0出现这个情况,升级到3.2.1也出现这个情况,而且我的情况更多,希望能帮忙看看,为什么这么多task instance host is null !!!!

更新:经过多次尝试和跟踪发现,由于任务瞬发内存占用过高导致ds相关服务异常,以至于后续容错恢复或重跑都无法正常执行,且会多次出现task instance host is null 的情况,进而导致任务失败过多。

z0L1n avatar Mar 06 '24 02:03 z0L1n

3.2.0 has the same problem

q4q5q6qw avatar Mar 19 '24 06:03 q4q5q6qw

后面服务重启了1次,又出现这种情况了。。。

liubo988 avatar Mar 25 '24 06:03 liubo988

我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够

sean1205 avatar Mar 29 '24 05:03 sean1205

我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够

遇到了同样的问题,最后怎么解决的?加worker机器么

ahululu avatar Apr 10 '24 06:04 ahululu

我们3.2.1也碰到了这个问题,目前看可能是因为full gc把map里面存的worder都给回收了或者配置的worder资源不够

遇到了同样的问题,最后怎么解决的?加worker机器么

是的,我们是放大了master的最大堆内存,减少full gc,并且增加了worker节点,就解决了

sean1205 avatar Apr 17 '24 09:04 sean1205

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar May 18 '24 00:05 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar May 27 '24 00:05 github-actions[bot]