Basic Arrow.jl-based collect and createDataFrame

Open exyi opened this issue 3 years ago • 1 comments

Functions collect_arrow, collect_tuples, and collect_df are provided, which all use Arrow.jl and Spark's Arrow support to transfer data from Spark to Julia. collect_arrow returns the raw Arrow.jl table, collect_df returns the DataFrame from DataFrames.jl, collect_tuples returns a simple Vector of named tuples.

createDataFrame now has overloads which accept a DataFrame or abstract Table

This version create a temporary file for each transfer, but I actually think it's preferable in many ways to socket based transfer:

Simpler :)
Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM datasets
or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory, without additional copying on Julia side

However, if you think sockets would be preferable I can change it (PySpark and SparkR use sockets)

Few things are missing

I included 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better.
- One collectToArrow is basically the same one as I used last year. The other comes from looking at PySpark currently does it.
- Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark
I added DataFrames.jl to the dependencies, but it doesn't seems right to depend on this fairly non-trivial library just for collect_df. Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Sep 10 '22 18:09 exyi

Thanks a lot! It's great to see Arrow getting back to Spark.jl!

This version create a temporary file for each transfer, but I actually think it's preferable in many ways to socket based transfer:

I'm totally fine with files. If I remember correctly, PySpark uses (used?) both - sockets and files in different places or with different settings. Anyway, I realized that it's not necessarily a good idea to follow PySpark or SparkR design since many decisions in them were made specifically to that languages or due to certain conditions that may not apply to our case. So let's do whatever is the best for Julia.

Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark

Can't we just generate, say, a big Parquet file?

Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Requires.jl was designed to do exactly this, though I haven't used it for years now and don't know its status.

Sep 10 '22 22:09 dfdx