sparkling icon indicating copy to clipboard operation
sparkling copied to clipboard

Add support for Data Frames

Open chrisbetz opened this issue 10 years ago • 25 comments

See https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

chrisbetz avatar Feb 19 '15 10:02 chrisbetz

@chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: https://github.com/yieldbot/flambo/pull/48/

Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.

chetmancini avatar Apr 02 '15 21:04 chetmancini

Hi,

thanks for offering to contribute! That’s great. I’ve not looked into the DataFrame API any further (just checked out the announcement document). But it looks promising and I really would like to support it.

Concerning the versioning - I’d like to think about that over the Easter holiday. Currently, I see two options:

a) Having different namespaces in the same project b) branching off sparkling-1.2.0-X.Y.Z and sparkling-1.3.0-X.Y.Z.

If you see any other good options, just tell me.

I’ll come back to you regarding this.

Sincerly,

Chris

Am 02.04.2015 um 23:12 schrieb Chet [email protected]:

@chrisbetz https://github.com/chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: yieldbot/flambo#48 https://github.com/yieldbot/flambo/pull/48 Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.

— Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling/issues/8#issuecomment-89045811.

chrisbetz avatar Apr 02 '15 21:04 chrisbetz

@chrisbetz Great. Both those sound like viable options; once you pick a route I'll see where we could take support for this. Have a great Easter.

chetmancini avatar Apr 02 '15 21:04 chetmancini

Hi guys! Any update on providing support for Data Frames?

erasmas avatar Apr 20 '15 08:04 erasmas

Hi, sorry, no support for that yet, as we need to support at least spark 1.1 from CDH, spark 1.2.x and spark 1.3 and I need to find a way to support all of them. Currently, I'm on serialization tasks and thus a little busy. Data Frame Support will definitively be the next thing to add, so stay tuned. Sorry, but coming up with a way to go requires some researching and testing around.

chrisbetz avatar Apr 20 '15 08:04 chrisbetz

@erasmas Currently I'm working on getting dataframe support into Flambo at the moment since that's what I'm using in prod (looking at switching to sparkling once I get some time to compare). Codes getting there but I've been having some issues getting Spark 1.3 to run on the cluster for final testing.

chetmancini avatar Apr 27 '15 16:04 chetmancini

Hi @chrisbetz @chetmancini Any updates on Data Frames support ?

prateekbhatt avatar Dec 21 '15 18:12 prateekbhatt

I may have some time to wrap some of the code that I've written, but I've only ever used Spark 1.5.x.

@chrisbetz let me know how you'd like to proceed.

retnuh avatar Jan 05 '16 09:01 retnuh

Out of interest what form would a DataFrames wrapper take? For the reading & queries side of things would it be some declarative DSL similar to Datomic Datalog for example?

alza-bitz avatar Mar 01 '16 00:03 alza-bitz

I doubt it would look like Datalog. Considering that Sparkling's RDD wrappers stick really close to the native interface, I'd say DataFrames would be similar. The DSL would probably be sorta SQL like where you have select statements with columns & expressions.

Going to far beyond that would probably impose quite an impedance mis-match...

On 1 March 2016 at 00:25, alzadude [email protected] wrote:

Out of interest what form would a DataFrames wrapper take? For the reading & queries side of things would it be some declarative DSL similar to Datomic Datalog for example?

— Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling/issues/8#issuecomment-190467075 .

retnuh avatar Mar 01 '16 08:03 retnuh

I'd really like to help, started putting something together the other day https://github.com/nabacg/sparkling/commit/ae935a551a0b9946afc20cbe69b149893c8cda36 very basic, not sure how far you guys got. Maybe we could join our efforts @retnuh ?

nabacg avatar Apr 28 '16 08:04 nabacg

I would like to help but I've not had much time to work on this lately - nor will I in the near future.

What I have is mostly just code that uses DataFrames; I hadn't really gotten to the point of abstracting out the useful stuff (like a select function that examines it's args and "does the right thing" with wrapping the args in an Array, if necessary, etc.)

On 28 April 2016 at 09:08, Grzegorz Caban [email protected] wrote:

I'd really like to help, started putting something together the other day nabacg@ae935a5 https://github.com/nabacg/sparkling/commit/ae935a551a0b9946afc20cbe69b149893c8cda36 very basic, not sure how far you guys got. Maybe we could join our efforts @retnuh https://github.com/retnuh ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/gorillalabs/sparkling/issues/8#issuecomment-215342668

retnuh avatar Apr 28 '16 08:04 retnuh

Hi, I create a PR at https://github.com/gorillalabs/sparkling/pull/49 . Most of the codes are working in my daily jobs. And I written some tests but not all. I will submit more tests for SQL and DataFrame functions.

MarchLiu avatar Jul 07 '16 17:07 MarchLiu

I have the following functionality that I could add:

  • Parquet Support
  • RDD <-> DataFrames
  • A handful of other SQL related functions that I needed for my project

One of the bigger outstanding problems that I see is how DataFrame joins work. The Java syntax needs a good macro wrapper, but I haven't had time to finish my attempt.

I don't want to step on @MarchLiu's efforts, so I'll wait until his changes are sorted out before I throw any of this into the mix. It looks solid. I like the how you made thread-ability a key part of your implementation. There were a couple spots where I should have done that but didn't.

NeilMenne avatar Jul 08 '16 15:07 NeilMenne

@NeilMenne would very much be interested in Parquet Support, if possible.

MafcoCinco avatar Mar 02 '17 04:03 MafcoCinco

To be clear, you can work with DataFrames and use parquet files in the existing version, it's just annoying. You have to use the Java API more or less directly, and it suffers from some warts between Java <-> Scala interop, particularly in the area of varargs.

I used it successfully but there was plenty of ugly code with creating and filling type specific arrays and weird calls where you have one string and then an array of strings, etc.

It is currently do-able, just ugly.

I talk a bit about it at the talk I gave at ClojureConj in 2015: https://youtu.be/ARBiyYyW4Ow?t=689

Slides: https://www.slideshare.net/ZalandoTech/spark-clojure-for-topic-discovery-zalando-tech-clojureconj-talk starting around slide 20-21

H

EDIT: I posted this before I saw the 2.0 sparkling release, obviously!

On 2 March 2017 at 04:27, Marcus Oladell [email protected] wrote:

@NeilMenne https://github.com/NeilMenne would very much be interested in Parquet Support, if possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gorillalabs/sparkling/issues/8#issuecomment-283552988, or mute the thread https://github.com/notifications/unsubscribe-auth/AARND4Ph3ew3ufYznFK5HRRjARFl8FLAks5rhkVDgaJpZM4DimQL .

retnuh avatar Mar 02 '17 10:03 retnuh

I no longer have access to the code I wrote at OpenTable. If there's a need for it, I could probably do a clean room implementation. I still use Spark at my current position, so it's fresh in my mind.

NeilMenne avatar Mar 02 '17 15:03 NeilMenne

@NeilMenne That would be great, especially in the area of more idiomatic support for Parquet and RDD <-> DataFrames. If it is a ton of work, don't worry about it but would definitely be useful if you had the time.

MafcoCinco avatar Mar 02 '17 15:03 MafcoCinco

I'll have to get back up to speed on sparkling, but I'll see what I can do.

NeilMenne avatar Mar 02 '17 15:03 NeilMenne

Awesome! Thanks so much.

MafcoCinco avatar Mar 02 '17 15:03 MafcoCinco

My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort.

MafcoCinco avatar Apr 06 '17 19:04 MafcoCinco

Hi,

Unfortunately, I do not have working examples for this. Maybe anybody out there?

Please, share your question on the sparkling google group. If you ask on twitter, I could retweet from gorillalabs to reach out.

Happy hacking!

Chris

Am 06.04.2017 um 21:56 schrieb Marcus Oladell [email protected]:

My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chrisbetz avatar Apr 07 '17 05:04 chrisbetz

I submitted a PR for adding support for SparkSession API. I think this will address most of what I personally need w.r.t. DataFrame and Parquet support, but I'm sure the implementation can be improved and made more complete.

MafcoCinco avatar Apr 19 '17 19:04 MafcoCinco

I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects.

Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently.

xsyn avatar Aug 23 '17 18:08 xsyn

Hi,

thanks for your input, and yes, I'm open to these additions. If you like, just create a PR with the things you'd like to see and I will look into it after my vacation.

Cheers,

Chris

Am 23.08.2017 um 14:34 schrieb Guy Taylor [email protected]:

I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects.

Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chrisbetz avatar Aug 24 '17 14:08 chrisbetz