PyAirbyte icon indicating copy to clipboard operation
PyAirbyte copied to clipboard

💡 Feature Request: Add `to_spark()` to on dataset class

Open jettdc opened this issue 10 months ago • 5 comments

There's an interesting use case here where you could use PyAirbyte directly in data pipelines that run on spark.

Currently if you want to do this, you need to do to_pandas() and then spark_session.createDataFrame(issues_df, shema=my_schema), but this seems inefficient, plus you have to manually define the schema (for example for json blobs which are object in pandas but need to be StringType in spark, and other idiosyncrasies like pandas having 64 bit ints but spark having Int and Long).

Or maybe a spark df cache would be more efficient here?

jettdc avatar Apr 05 '24 14:04 jettdc

Hi, @jettdc! I really like this idea - thanks for suggesting! 💡

The only consideration/tradoff I can see is perhaps in the PySpark dependency being a heavy addition for those not needing or using Spark. That said, I haven't checked it recently to see its size/footprint, and we could also deliver it as an add-on (Python "extra"), or just not bundle it and fail calls to to_spark_df() if PySpark is not installed.

What do you think of these options?

aaronsteers avatar Apr 05 '24 15:04 aaronsteers

To your other suggestion of a SparkCache implementation, I'm not against it but I can see it being a significant development effort that could be hard for us to prioritize in the near-term.

Now that PyIceberg supports writing datasets, there could be a path where we (long-term) support an IcebergCache option, and that would be natively compatible with Spark and a lot of other big data platforms at the same time.

aaronsteers avatar Apr 05 '24 16:04 aaronsteers

or just not bundle it and fail calls to to_spark_df() if PySpark is not installed

IMO this seems like a good solution, and variants of this pattern are already used in many places (including PyIceberg and pandas)

jettdc avatar Apr 05 '24 16:04 jettdc

For my use case, I can't speak to the value of an IcebergCache since we don't utilize Iceberg in our data lakes, but an interesting component of doing cache at the data processing layer (spark) instead of the data storage layer (iceberg) is that you would get compatibility automatically with everything that spark can write to (in theory)

Anyways the to_spark_df is sufficient for me, just some food for thought :)

jettdc avatar Apr 05 '24 16:04 jettdc

@jettdc - Thanks for this feedback! I've updated the title of this issue slightly. I think to_spark() would be a great addition, so I'm adding the labels good first issue and accepting pull requests.

The spark cache implementation is a bit nebulous as of now, but the to_spark() approach seems like it could be a big win for data engineers. (We can optionally open another issue for spark / data lake cache...)

aaronsteers avatar Apr 16 '24 04:04 aaronsteers