Spark-Structured-Streaming-Examples icon indicating copy to clipboard operation
Spark-Structured-Streaming-Examples copied to clipboard

Spark 2.3.1

Open polomarcus opened this issue 6 years ago • 11 comments

Spark 2.2.0 to 2.3.1

Need to update Cassandra Sink

polomarcus avatar Jun 19 '18 03:06 polomarcus

I guess it will not work. I tried to upgrade spark to 2.3.1 and it started returning such an error: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;

btw. there is already sink for Cassandra in DSE 6. I am wondering when(if) they port that solution to cassandra driver. The sink I mentioned exists in spark-connector-6.0.2.jar CassandraSourceRelation. I tested that in stand-alone DSE and it works with .outputMode("update"). It is a pity that community can not use that solution for free...

cranberrysoft avatar Aug 03 '18 06:08 cranberrysoft

Hey Mariusz,

I know, the custom Cassandra sink using the datastax's connector does not work with spark 2.3. I need to spend time to fix that..

I'm aware of the DSE 6 sink, I talked about it with Russel, and it's not planned to release it for the community.

Paul

polomarcus avatar Aug 03 '18 07:08 polomarcus

Hi Paul I'd love to help you with development of sink for Cassandra especially that it is not going to be included to the open-source driver as you said. Please let me know how I can reach you if you need any help in this matter.

cranberrysoft avatar Aug 03 '18 10:08 cranberrysoft

I would have a look to the elastic sink, which is open source, and see their implementation to be inspired. Hopefully, we just need to change import (DatasourceV2 or something like that) but it can also be, rewrite the sink to be 2.3 compliant and it may take some time :/

We also have the foreach sink that can be used with Cassandra. I refer to it as "unsafe" in the repo

polomarcus avatar Aug 03 '18 21:08 polomarcus

I thought also about foreach sink but it has two downsides. ~~First of all it does not support this stateful transaminations which are the key things when it comes to Structured Streaming.~~ Secondly I believe that this solution is not really optimal since it use low level API to save data to Cassandra and you operate on a row so probably all the under-hood optimization which are done by the driver is lost. I am pretty sure that you saw one of Russel videos about the Cassandra driver https://www.youtube.com/watch?v=cKIHRD6kUOc

I also tried to find an inspiration in DSE implementation unfortunately it's not opensource and it is Scala code so you can not easily decompile the code ;) but I will also try to dig a little bit to understand the way it should have been implemented.

cranberrysoft avatar Aug 04 '18 00:08 cranberrysoft

Hi guys, I'm also interested in this and I'd love to help you with development. Please let me know how I can contact you for this effort. Cheers

redsk avatar Aug 23 '18 10:08 redsk

See also: https://github.com/scylladb/scylla-code-samples/issues/67#issuecomment-416545149

snowch avatar Aug 28 '18 15:08 snowch

Thanks for all your messages :smile:

If you feel like give it a try, the offical Elastic sink can be a great source of inspiration for the Cassandra sink

  • https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/streaming

Compared to what we have in the repo :

  • https://github.com/polomarcus/Spark-Structured-Streaming-Examples/tree/master/src/main/scala/cassandra/StreamSinkProvider
  • https://github.com/polomarcus/Spark-Structured-Streaming-Examples/blob/master/src/main/scala/cassandra/CassandraDriver.scala#L55-L63

I might be able to spend some time on the issue the following month.

polomarcus avatar Aug 28 '18 20:08 polomarcus

Looks like there is some useful stuff in here: https://github.com/scylladb/scylla-code-samples/pull/68

snowch avatar Sep 20 '18 12:09 snowch

thanks @snowch Scylla does it the same way by using the Datastax's connector : https://github.com/scylladb/scylla-code-samples/pull/68/files#diff-1e869081fec2d3c842a3b91688825a5eR71

I'm guessing it should be a small fix to be able to have the project running for spark 2.3.1 and the cassandra sink

polomarcus avatar Sep 20 '18 12:09 polomarcus

@polomarcus are you planning to implement the fix you suggested above?

snowch avatar Oct 26 '18 13:10 snowch