incubator-retired-gearpump
incubator-retired-gearpump copied to clipboard
Cassandra integration
Cassandra database integration
- [X] CassandraSource
- [X] CassandraSink
- [X] CassandraStore
Reuses some Spark-Cassandra connector files and follows how that works. The intent is to allow the connector to be reused when version for other processing systems is available. The Source looks up token ranges in the desired table, splits to independent sets of partitions and assigns those to available number of source tasks, allowing very good parallelism. All fetches of data except the first one are asynchronous. The Sink can be trivially parallelised by the user where different writes are assigned to different tasks.
The Source scans a current table snapshot and does not currently honour updates (so not a continuous stream). The source is not time replayable. There are options how to handle both these, but must be properly thought through. The test coverage is poor at the moment. but this first attempt will allow iteration and continuous improvement of the code and adding features.
@zapletal-martin Thanks for your contribution. I'll pull your branch and try playing with it.