marquez icon indicating copy to clipboard operation
marquez copied to clipboard

support for OpenLineage's RUNNING eventType

Open mobuchowski opened this issue 3 years ago • 2 comments

OpenLineage introduces RUNNING event type which models continuous streaming job that it currently running - to differentiate it from generic OTHER event type. Related issues are https://github.com/OpenLineage/OpenLineage/issues/946 and discussion here: https://github.com/OpenLineage/OpenLineage/issues/599

Are there any possible problems within Marquez with receiving those type of events? I know LineageEvent has String eventType - but there could be something else dependant on existing event types.

mobuchowski avatar Aug 02 '22 15:08 mobuchowski

Related PRs: https://github.com/OpenLineage/OpenLineage/pull/972 https://github.com/OpenLineage/OpenLineage/pull/985

mzareba382 avatar Aug 03 '22 15:08 mzareba382

While Marquez will support an event of type RUNNING, when considering this in the context of a streaming job, we may need to consider the impact of this event on job versions and dataset versions. Currently, Marquez sets the current version of a job and a dataset only when receiving a COMPLETE event. Dataset versions are created before then, but the dataset record itself isn't updated until COMPLETE. Job versions aren't created at all until a COMPLETE event is received. Most importantly, lineage only considers the current_version_uuid column of the jobs table. This means that a streaming job won't show any lineage at all until the job terminates with a COMPLETE event. We can update the logic here, but we need to know it's a streaming job. Perhaps a facet to report that it's a streaming job, not a batch job?

collado-mike avatar Aug 03 '22 22:08 collado-mike