tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

Engine should be robust when create worker returns error in executor runtime

Open amyangfei opened this issue 2 years ago • 1 comments

Is your feature request related to a problem?

Can reproduce as follows

  • Setup a dataflow engine cluster, with 1 master and 3 executor nodes.
  • Create a fake job with empty job-config

code bash: master@https://github.com/pingcap/tiflow/commit/72c93c2a726f954af02771153df30cc5223dcebf

Will observe create worker failure in executor runtime and job manager will try to dispatch this task infinitely

  • worker failure
[2022/08/08 09:41:31.125 +00:00] [ERROR] [server.go:199] ["Failed to create worker"] [error="unexpected end of JSON input"] [errorVerbose="unexpected end of JSON input\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/engine/framework/registry.(*SimpleWorkerFactory[...]).DeserializeConfig\n\tgithub.com/pingcap/tiflow/engine/framework/registry/factory.go:84\ngithub.com/pingcap/tiflow/engine/framework/registry.(*registryImpl).CreateWorker\n\tgithub.com/pingcap/tiflow/engine/framework/registry/registry.go:92\ngithub.com/pingcap/tiflow/engine/executor.(*Server).makeTask\n\tgithub.com/pingcap/tiflow/engine/executor/server.go:192\ngithub.com/pingcap/tiflow/engine/executor.(*Server).PreDispatchTask\n\tgithub.com/pingcap/tiflow/engine/executor/server.go:216\ngithub.com/pingcap/tiflow/engine/enginepb._Executor_PreDispatchTask_Handler\n\tgithub.com/pingcap/tiflow/engine/enginepb/executor.pb.go:437\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/[email protected]/server.go:922\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
  • worker recreate
[2022/08/08 09:51:34.030 +00:00] [INFO] [master.go:577] [CreateWorker] [job_id=dataflow-engine-job-manager] [worker-type=3] [worker-config="{\"seq-id\":7,\"created-at\":\"2022-08-08T09:51:26.183Z\",\"updated-at\":\"2022-08-08T09:51:26.183
Z\",\"project-id\":\"20001\",\"id\":\"2562bdc8-c418-4e01-96a8-8b486fb00225\",\"type\":3,\"status\":1,\"node-id\":\"\",\"addr\":\"\",\"epoch\":0,\"config\":\"\",\"Deleted\":null}"] [cost=1] [resources="[]"] [master-id=dataflow-engine-job-m
anager]
[2022/08/08 09:51:34.030 +00:00] [INFO] [job_fsm.go:209] ["job master recovered"] [job="{\"seq-id\":7,\"created-at\":\"2022-08-08T09:51:26.183Z\",\"updated-at\":\"2022-08-08T09:51:26.183Z\",\"project-id\":\"20001\",\"id\":\"2562bdc8-c418-
4e01-96a8-8b486fb00225\",\"type\":3,\"status\":1,\"node-id\":\"\",\"addr\":\"\",\"epoch\":0,\"config\":\"\",\"Deleted\":null}"]
[2022/08/08 09:51:34.031 +00:00] [INFO] [server.go:199] [payload="task_id:\"2562bdc8-c418-4e01-96a8-8b486fb00225\" cost:1 "] [request=ScheduleTask]
[2022/08/08 09:51:34.034 +00:00] [INFO] [master.go:646] ["DispatchTask failed"] [job_id=dataflow-engine-job-manager] [error="rpc error: code = Aborted desc = unexpected end of JSON input"] [errorVerbose="rpc error: code = Aborted desc = u
nexpected end of JSON input\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/engine/pkg/rpcerror.FromGRPCError\n\tgithub.com/pingcap/tiflow/engine/pkg/rpcerror/grpc.go:63\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).callOnce\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:88\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do.func1\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:74\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:57\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:72\ngithub.com/pingcap/tiflow/engine/pkg/client.(*executorServiceClient).DispatchTask\n\tgithub.com/pingcap/tiflow/engine/pkg/client/executor_service_client.go:94\ngithub.com/pingcap/tiflow/engine/framework.(*DefaultBaseMaster).CreateWorker.func1\n\tgithub.com/pingcap/tiflow/engine/framework/master.go:638\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/08/08 09:51:34.830 +00:00] [WARN] [jobmanager.go:465] ["dispatch worker met error"] [error="rpc error: code = Aborted desc = unexpected end of JSON input"] [errorVerbose="rpc error: code = Aborted desc = unexpected end of JSON input\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/engine/pkg/rpcerror.FromGRPCError\n\tgithub.com/pingcap/tiflow/engine/pkg/rpcerror/grpc.go:63\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).callOnce\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:88\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do.func1\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:74\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:57\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:72\ngithub.com/pingcap/tiflow/engine/pkg/client.(*executorServiceClient).DispatchTask\n\tgithub.com/pingcap/tiflow/engine/pkg/client/executor_service_client.go:94\ngithub.com/pingcap/tiflow/engine/framework.(*DefaultBaseMaster).CreateWorker.func1\n\tgithub.com/pingcap/tiflow/engine/framework/master.go:638\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/08/08 09:51:34.830 +00:00] [INFO] [master.go:577] [CreateWorker] [job_id=dataflow-engine-job-manager] [worker-type=3] [worker-config="{\"seq-id\":7,\"created-at\":\"2022-08-08T09:51:26.183Z\",\"updated-at\":\"2022-08-08T09:51:26.183Z\",\"project-id\":\"20001\",\"id\":\"2562bdc8-c418-4e01-96a8-8b486fb00225\",\"type\":3,\"status\":1,\"node-id\":\"\",\"addr\":\"\",\"epoch\":0,\"config\":\"\",\"Deleted\":null}"] [cost=1] [resources="[]"] [master-id=dataflow-engine-job-manager]
[2022/08/08 09:51:34.830 +00:00] [INFO] [job_fsm.go:209] ["job master recovered"] [job="{\"seq-id\":7,\"created-at\":\"2022-08-08T09:51:26.183Z\",\"updated-at\":\"2022-08-08T09:51:26.183Z\",\"project-id\":\"20001\",\"id\":\"2562bdc8-c418-4e01-96a8-8b486fb00225\",\"type\":3,\"status\":1,\"node-id\":\"\",\"addr\":\"\",\"epoch\":0,\"config\":\"\",\"Deleted\":null}"]
[2022/08/08 09:51:34.831 +00:00] [INFO] [server.go:199] [payload="task_id:\"2562bdc8-c418-4e01-96a8-8b486fb00225\" cost:1 "] [request=ScheduleTask]
[2022/08/08 09:51:34.833 +00:00] [INFO] [master.go:646] ["DispatchTask failed"] [job_id=dataflow-engine-job-manager] [error="rpc error: code = Aborted desc = unexpected end of JSON input"] [errorVerbose="rpc error: code = Aborted desc = unexpected end of JSON input\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/engine/pkg/rpcerror.FromGRPCError\n\tgithub.com/pingcap/tiflow/engine/pkg/rpcerror/grpc.go:63\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).callOnce\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:88\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do.func1\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:74\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:57\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/engine/pkg/client/internal.(*Call[...]).Do\n\tgithub.com/pingcap/tiflow/engine/pkg/client/internal/call.go:72\ngithub.com/pingcap/tiflow/engine/pkg/client.(*executorServiceClient).DispatchTask\n\tgithub.com/pingcap/tiflow/engine/pkg/client/executor_service_client.go:94\ngithub.com/pingcap/tiflow/engine/framework.(*DefaultBaseMaster).CreateWorker.func1\n\tgithub.com/pingcap/tiflow/engine/framework/master.go:638\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

Describe the feature you'd like

There are several optimizations

  • [ ] Check job config before task is dispatched.
  • [ ] Restrict the frequency of task re-dispatching, or mark task as failed if it fails too frequent
  • [ ] Should be able to cancel a job it keeps failing.

Describe alternatives you've considered

No response

Teachability, Documentation, Adoption, Migration Strategy

No response

amyangfei avatar Aug 08 '22 09:08 amyangfei

/assign

CharlesCheung96 avatar Aug 08 '22 10:08 CharlesCheung96

Will be solved in https://github.com/pingcap/tiflow/issues/6749

amyangfei avatar Aug 30 '22 02:08 amyangfei

Closed by #6749

amyangfei avatar Mar 02 '23 06:03 amyangfei