spline-spark-agent icon indicating copy to clipboard operation
spline-spark-agent copied to clipboard

Spark Streaming Support

Open lucienfregosi opened this issue 7 years ago • 12 comments

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

lucienfregosi avatar Feb 14 '18 10:02 lucienfregosi

Hi Lucien, We are currently enhancing Spline to also support Structured Streaming. This feature will come with the Spline version 0.3.

Regards, Marek Novotny

On Wed, Feb 14, 2018 at 11:17 AM, Lucien Fregosi [email protected] wrote:

Is it planned to add an integration with Spark Streaming ? It could be useful to be able to apply some lineage for batch and streaming data

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgR_QjvZzXZ2-7s9S3lDAsjIW4qH5Qfks5tUrLWgaJpZM4SFE7m .

mn-mikke avatar Feb 14 '18 10:02 mn-mikke

Perfect :)

I'm writing a blog post about Spline after testing it (in french first, maybe in english later) i will be able to provide this information in my post.

lucienfregosi avatar Feb 14 '18 10:02 lucienfregosi

@lucienfregosi Hi, we have some basic support in version 0.3 but disabled at the moment. I will be working on full support including structured streaming now as highest priority. Deadline will be end of August.

vackosar avatar Jun 22 '18 08:06 vackosar

@lucienfregosi we will not support the old streaming using RDDs at the moment. Any issues for u to switch to structured streaming instead which will be supported? It seems to be treated as successor of old streaming.

vackosar avatar Jun 22 '18 08:06 vackosar

  • POC version of streaming support was presented at Spark Summit London 2018 https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html
  • It has been merged to develop, but it haven't been officially released yet. Current priority is to change persistence to ArangoDB and reimplement UI.

vackosar avatar Dec 12 '18 13:12 vackosar

We are withdrawing streaming support from Spline 0.4 as it was not implemented properly. Streaming is not a priority for us at the moment. We'll return to it later.

wajda avatar Jul 30 '19 13:07 wajda

A test case - AbsaOSS/spline#331

wajda avatar Sep 24 '19 11:09 wajda

Hi @wajda , May I confirm that the Structured Streaming is not supported such as writeStream API? Thanks

NickDudu avatar Apr 16 '22 12:04 NickDudu

No, streaming is not supported due to fundamental problems with the definition and representation of data lineage in context of streaming. The topic remains unclear.

wajda avatar Apr 17 '22 22:04 wajda

Hi @wajda No problem, thanks for the confirmation.

NickDudu avatar Apr 19 '22 02:04 NickDudu

Hello Everyone, We have been investigating spline and spark structured streaming. We have been able to implement spline-agent for spark structured streaming using spark’s StreamingQueryListener, in a similar way as is described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (9:02 - 11:23). Code for our POC can be found here: https://github.com/jozefbakus/spline-spark-agent/pull/1

Along the way we came across one major problem, linking. Linking in terms of connecting streaming parent-child lineages. Currently, time linking is used: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (18:33 - 20:14). Time linking is not sufficient for streaming jobs. We are trying to find a suitable type of linking for streaming jobs. One of the solutions might be using kafka offsets similar way as described here: https://absaoss.github.io/spline/2018/10/04/Spline-Data-Lineage-For-Structured-Streaming.html (20:14 - 22:15).

To be able to link parent-child lineages, source and destination offsets (read and write offsets) are required. Spark gives us source offsets out of the box, the problem lies in destination offsets. Spark does not provide information about what offsets data was written to. Getting destination offsets in a nice, pluggable way is our current issue that we are trying to resolve before we can move forward.

Using read/write offsets linking might not be the only way, so we are also investigating different types of lineage linking.

jozefbakus avatar Apr 22 '22 12:04 jozefbakus

The Spark Streaming support has been deprioritized, so I'm removing this feature from the active backlog.

wajda avatar Jan 26 '23 10:01 wajda