spark-cassandra-connector Implemented where that have individual clause for each row.

Implemented where that have individual clause for each row.

Open crakjie opened this issue 7 years ago • 9 comments

Added individual clause to the joins where. This allow this kind of syntax :

rdd.joinWithCassandraTable(ks, tableName).where("timestampMilis = ?", (k : KVRow) => Seq(k.timestampSecond * 1000))

Of course the syntax un the where can be enhance, it's why it's open to review.

Nov 28 '16 16:11 crakjie

Hi @crakjie, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign Spark Cassandra Connector CLA. It's all electronic and will take just minutes.

Nov 28 '16 16:11 datastax-bot

Thank you @crakjie for signing the Spark Cassandra Connector CLA.

Dec 06 '16 17:12 datastax-bot

Can you write a little bit of the use case for this api? I took a brief look today (sorry so busy) and I think it's a very cool Idea but i'm having a hard time thinking about how someone would actually use it?

Dec 21 '16 01:12 RussellSpitzer

Hmm I think it can do possibility to join as below standard SQL query select * from tb1 join tb2 on id = id where tb1.eventTime between tb2.from and tb2.to When I have data as below:

tb1 (RDD) id|eventTime|others 1|12:30|foo

tb2 (table in Cassandra) id|from|to|asset name 1|11:10|11:40|Bar 1|11:41|15:30|What Im looking

Now as I know when I do joinwithCassandraTable I will receive this two rows but after this patch I will receive only 1 For me it will be very nice feature but I don't know if I understood this code correctly. Please correct me if I'm wrong.

Dec 22 '16 09:12 LukaszZu

I had this idea because I have to do a join over timestamp but not == timestamp.

The database was contening timestamp older than the left RDD. And each element of the left RDD was containing the information about how old RDD the element have to be joined with. So to do that I had to have an information contained in each left element.

So the general idea was to be able to modify the where close depending on each input elements.

I still don't know if the type of the "fwhere" function is good or if it can be simplified. Actually the function has to return an internal scc object ..

Dec 22 '16 10:12 crakjie

I'm wondering if we might be better off with just another api, like a generic "RunPreparedStatements"

Which would be something like

RDD[BoundParameters].runPreparedStatements[ReturnType]("CQL HERE with ? PARAMETERS ?")

Of which then the Joins become a child class of?

Dec 22 '16 20:12 RussellSpitzer

@RusselSpitzer +1 to that idea. This would be a really great addition, giving users very strong flexibility on RDD processing.

Dec 23 '16 15:12 etspaceman

Back from hollydays. Why not @RussellSpitzer, but how de validity of the request is made?

Jan 17 '17 12:01 crakjie

I think we should pause on this and instead focus on making completely flexible function. Like I described above, that way we don't increase the complexity of the code as is and are able to introduce a greater amount of flexibility.

Jan 17 '17 19:01 RussellSpitzer

spark-cassandra-connector spark-cassandra-connector copied to clipboard

Implemented where that have individual clause for each row.

spark-cassandra-connector
spark-cassandra-connector copied to clipboard