spring-cloud-dataflow icon indicating copy to clipboard operation
spring-cloud-dataflow copied to clipboard

Add instance id for dataflow server, that will be associated with tasks execution

Open sergeibulavintsev opened this issue 5 years ago • 7 comments
trafficstars

It would be nice to add instance id for dataflow server, which identifies uniquely it. It happens that dataflow server is restarted or crashes and task execution has start time and no end time, thus it is impossible to say after restart is task still running or it belongs to previously running dataflow server and has to be manually cleaned up. I think instance id saved along with task execution can help to figure out task still running or crashed with server and should be restarted or cleaned up. And would be nice to return current dataflow instance id with rest api in /about method.

sergeibulavintsev avatar Oct 30 '20 15:10 sergeibulavintsev

Hi, @sergeibulavintsev. It sounds like useful information from a tracing perspective. If you have cycles, please feel free to take a stab at it through a pull request.

Although, I wonder if there are other ways to present this correlation already. @ilayaperumalg / @jvalkeal: Any ideas?

sabbyanandan avatar Nov 05 '20 01:11 sabbyanandan

Might not be a bad idea at all. We should probably do a wider poc for generic observability/tracing features. I'd really appreciate to more easily to see and trace a problem when things go south in a complex environment(as sooner or later something always goes wrong).

jvalkeal avatar Dec 11 '20 12:12 jvalkeal

In 2.8.x, we now support the tracing at the level of SCDF, Skipper, and Apps (including, Tasks). @jvalkeal @tzolov: Is this something sufficient for now?

sabbyanandan avatar May 24 '21 17:05 sabbyanandan

@jvalkeal @tzolov: thoughts?

sabbyanandan avatar Oct 19 '21 21:10 sabbyanandan

The current setup has a 1-to-1 relationship between the data flow server and the database tables, that is one instance of SCDF that points to the database 'owns' it. There is not other owner, other than the scdf server that connects to the database. So, if the ID was stored.

It happens that dataflow server is restarted or crashes and task execution has start time and no end time, thus it is impossible to say after restart is task still running or it belongs to previously running dataflow server and has to be manually cleaned up. I think instance id saved along with task execution can help to figure out task still running or crashed with server and should be restarted or cleaned up.

In the case there is a start time and no end time in the database, then either the task is still running or it crashed. If the task is still running, then information in the task table can be used to find the status of that task on the cluster that had ran the task. If the task 'crashed hard', sigfault, etc. then scdf should put that task in a state indicated it is 'unknown' and offer the possibility to change the exit status manually through an API/UI.

@cppwfs @corneil - thoughts?

markpollack avatar Sep 20 '22 14:09 markpollack

The relationship between dataflow (DF) server and execution environment(s) needs a re-think. The ideal would be many execution environments of different types where some is considered a mirror of another and some are unique. The execution of apps with in an environment could be done by DF or an 'agent' which is what Skipper does to some extent at that the moment. The reality is that for k8s, CF and (TAP in the future) you won't need an agent do start the execution of any app/task/job. Only the local deployment requires code that is currently split between DF and Skipper. This should not be recommended for production use and shouldn't need multiple environments. We can do some work to ensure dataflow is well-behaved when multiple instances are running. Scheduling and execution shouldn't rely on DF being active.

corneil avatar Sep 20 '22 15:09 corneil

When a task has a start time and no end time that can mean several things:

  1. Data flow attempted to launch the task but the platform sent a positive ack of the launch but actually failed to launch the app
  2. The app was started but failed for another reason (DB not present, bug in the code etc)
  3. The application had an override that had it write the task data to another DB.
  4. While dataflow may fail in the middle of launching a task could occur and thus cause the problem. (pretty remote)
  5. Task is still running

But if Data flow fails while the task is running should not prevent the task from updating the database.

We could have a timeout setup in dataflow that states if a task stays in X condition for Y period of time set the state of the task to Unknown. This could be task definition parameter.

cppwfs avatar Sep 20 '22 16:09 cppwfs