spark-cassandra-connector
spark-cassandra-connector copied to clipboard
Implemented where that have individual clause for each row.
Added individual clause to the joins where. This allow this kind of syntax :
rdd.joinWithCassandraTable(ks, tableName).where("timestampMilis = ?", (k : KVRow) => Seq(k.timestampSecond * 1000))
Of course the syntax un the where can be enhance, it's why it's open to review.
Hi @crakjie, thanks for your contribution!
In order for us to evaluate and accept your PR, we ask that you sign Spark Cassandra Connector CLA. It's all electronic and will take just minutes.
Thank you @crakjie for signing the Spark Cassandra Connector CLA.
Can you write a little bit of the use case for this api? I took a brief look today (sorry so busy) and I think it's a very cool Idea but i'm having a hard time thinking about how someone would actually use it?
Hmm I think it can do possibility to join as below standard SQL query select * from tb1 join tb2 on id = id where tb1.eventTime between tb2.from and tb2.to When I have data as below:
tb1 (RDD) id|eventTime|others 1|12:30|foo
tb2 (table in Cassandra) id|from|to|asset name 1|11:10|11:40|Bar 1|11:41|15:30|What Im looking
Now as I know when I do joinwithCassandraTable I will receive this two rows but after this patch I will receive only 1 For me it will be very nice feature but I don't know if I understood this code correctly. Please correct me if I'm wrong.
I had this idea because I have to do a join over timestamp but not == timestamp.
The database was contening timestamp older than the left RDD. And each element of the left RDD was containing the information about how old RDD the element have to be joined with. So to do that I had to have an information contained in each left element.
So the general idea was to be able to modify the where close depending on each input elements.
I still don't know if the type of the "fwhere" function is good or if it can be simplified. Actually the function has to return an internal scc object ..
I'm wondering if we might be better off with just another api, like a generic "RunPreparedStatements"
Which would be something like
RDD[BoundParameters].runPreparedStatements[ReturnType]("CQL HERE with ? PARAMETERS ?")
Of which then the Joins become a child class of?
@RusselSpitzer +1 to that idea. This would be a really great addition, giving users very strong flexibility on RDD processing.
Back from hollydays. Why not @RussellSpitzer, but how de validity of the request is made?
I think we should pause on this and instead focus on making completely flexible function. Like I described above, that way we don't increase the complexity of the code as is and are able to introduce a greater amount of flexibility.