spline-spark-agent Spark Streaming Support

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

Feb 14 '18 10:02 lucienfregosi

Hi Lucien, We are currently enhancing Spline to also support Structured Streaming. This feature will come with the Spline version 0.3.

Regards, Marek Novotny

On Wed, Feb 14, 2018 at 11:17 AM, Lucien Fregosi [email protected] wrote:

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgR_QjvZzXZ2-7s9S3lDAsjIW4qH5Qfks5tUrLWgaJpZM4SFE7m .

Feb 14 '18 10:02 mn-mikke

Perfect :)

I'm writing a blog post about Spline after testing it (in french first, maybe in english later) i will be able to provide this information in my post.

Feb 14 '18 10:02 lucienfregosi

@lucienfregosi Hi, we have some basic support in version 0.3 but disabled at the moment. I will be working on full support including structured streaming now as highest priority. Deadline will be end of August.

Jun 22 '18 08:06 vackosar

@lucienfregosi we will not support the old streaming using RDDs at the moment. Any issues for u to switch to structured streaming instead which will be supported? It seems to be treated as successor of old streaming.

Jun 22 '18 08:06 vackosar

POC version of streaming support was presented at Spark Summit London 2018 https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html
It has been merged to develop, but it haven't been officially released yet. Current priority is to change persistence to ArangoDB and reimplement UI.

Dec 12 '18 13:12 vackosar

We are withdrawing streaming support from Spline 0.4 as it was not implemented properly. Streaming is not a priority for us at the moment. We'll return to it later.

Jul 30 '19 13:07 wajda

A test case - AbsaOSS/spline#331

Sep 24 '19 11:09 wajda

Hi @wajda , May I confirm that the Structured Streaming is not supported such as writeStream API? Thanks

Apr 16 '22 12:04 NickDudu

No, streaming is not supported due to fundamental problems with the definition and representation of data lineage in context of streaming. The topic remains unclear.

Apr 17 '22 22:04 wajda

Hi @wajda No problem, thanks for the confirmation.

Apr 19 '22 02:04 NickDudu

Hello Everyone, We have been investigating spline and spark structured streaming. We have been able to implement spline-agent for spark structured streaming using spark’s StreamingQueryListener, in a similar way as is described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (9:02 - 11:23). Code for our POC can be found here: https://github.com/jozefbakus/spline-spark-agent/pull/1

Along the way we came across one major problem, linking. Linking in terms of connecting streaming parent-child lineages. Currently, time linking is used: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (18:33 - 20:14). Time linking is not sufficient for streaming jobs. We are trying to find a suitable type of linking for streaming jobs. One of the solutions might be using kafka offsets similar way as described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (20:14 - 22:15).

To be able to link parent-child lineages, source and destination offsets (read and write offsets) are required. Spark gives us source offsets out of the box, the problem lies in destination offsets. Spark does not provide information about what offsets data was written to. Getting destination offsets in a nice, pluggable way is our current issue that we are trying to resolve before we can move forward.

Using read/write offsets linking might not be the only way, so we are also investigating different types of lineage linking.

Apr 22 '22 12:04 jozefbakus

The Spark Streaming support has been deprioritized, so I'm removing this feature from the active backlog.

Jan 26 '23 10:01 wajda

spline-spark-agent spline-spark-agent copied to clipboard

Spark Streaming Support

spline-spark-agent
spline-spark-agent copied to clipboard