incubator-retired-gearpump icon indicating copy to clipboard operation
incubator-retired-gearpump copied to clipboard

Cassandra integration

Open zapletal-martin opened this issue 8 years ago • 1 comments

Cassandra database integration

  • [X] CassandraSource
  • [X] CassandraSink
  • [X] CassandraStore

Reuses some Spark-Cassandra connector files and follows how that works. The intent is to allow the connector to be reused when version for other processing systems is available. The Source looks up token ranges in the desired table, splits to independent sets of partitions and assigns those to available number of source tasks, allowing very good parallelism. All fetches of data except the first one are asynchronous. The Sink can be trivially parallelised by the user where different writes are assigned to different tasks.

The Source scans a current table snapshot and does not currently honour updates (so not a continuous stream). The source is not time replayable. There are options how to handle both these, but must be properly thought through. The test coverage is poor at the moment. but this first attempt will allow iteration and continuous improvement of the code and adding features.

zapletal-martin avatar Jul 24 '16 15:07 zapletal-martin

@zapletal-martin Thanks for your contribution. I'll pull your branch and try playing with it.

manuzhang avatar Jul 25 '16 01:07 manuzhang