seldon-core icon indicating copy to clipboard operation
seldon-core copied to clipboard

fix(dataflow): do not create new KafkaStreams app for existing pipelines

Open lc525 opened this issue 1 year ago • 0 comments

This fixes a dataflow-engine bug triggered when the scheduler sends/re-sends pipeline creation messages after a restart (because it's not aware of their status across various components).

Previous behaviour: Dataflow-engine, on receiving a command to create a pipeline, would first create a new Kafka Streams application for this pipeline, before checking if one already exists and it's running.

Because of this, triggering a control-plane restart of the scheduler would result in dataflow errors for pipelines that kept internal state (mostly pipelines making use of triggers/joins). Kafka Streams would complain about an existing application using the same state directory, fail the newly created pipeline and inform the scheduler about this.

However, in actuality the old pipeline, if it was previously running ok, would continue doing so inside dataflow. This meant that a disconnect between the state of dataflow-engine and what the scheduler knew about it was being created

New behaviour: The introduced changes mean that dataflow-engine first checks if a pipeline with the same id is already running. If its state is ok, dataflow simply informs the scheduler that the pipeline is created, without taking further action.

If a pipeline with the same id already exists but is in a failed state, it is first stopped (local Kafka Streams state is cleaned), then an attempt is made to re-create it, with the corresponding status being sent to the scheduler.

Which issue(s) this PR fixes:

  • INFRA-978 (internal issue) Pipelines fail to start because Kafka Streams state hasn't been cleared

Special notes for your reviewer:

  • Tested by restarting scheduler while a pipeline using state (a join) was running on dataflow engine
  • Ran pipeline smoke-tests to confirm the change hasn't introduced unexpected behaviour

lc525 avatar Apr 25 '24 12:04 lc525