PyAirbyte
PyAirbyte copied to clipboard
💡 Feature Request: Add `to_spark()` to on dataset class
There's an interesting use case here where you could use PyAirbyte directly in data pipelines that run on spark.
Currently if you want to do this, you need to do to_pandas()
and then spark_session.createDataFrame(issues_df, shema=my_schema)
, but this seems inefficient, plus you have to manually define the schema (for example for json blobs which are object
in pandas but need to be StringType
in spark, and other idiosyncrasies like pandas having 64 bit ints but spark having Int and Long).
Or maybe a spark df cache would be more efficient here?
Hi, @jettdc! I really like this idea - thanks for suggesting! 💡
The only consideration/tradoff I can see is perhaps in the PySpark dependency being a heavy addition for those not needing or using Spark. That said, I haven't checked it recently to see its size/footprint, and we could also deliver it as an add-on (Python "extra"), or just not bundle it and fail calls to to_spark_df()
if PySpark is not installed.
What do you think of these options?
To your other suggestion of a SparkCache implementation, I'm not against it but I can see it being a significant development effort that could be hard for us to prioritize in the near-term.
Now that PyIceberg supports writing datasets, there could be a path where we (long-term) support an IcebergCache
option, and that would be natively compatible with Spark and a lot of other big data platforms at the same time.
or just not bundle it and fail calls to to_spark_df() if PySpark is not installed
IMO this seems like a good solution, and variants of this pattern are already used in many places (including PyIceberg and pandas)
For my use case, I can't speak to the value of an IcebergCache since we don't utilize Iceberg in our data lakes, but an interesting component of doing cache at the data processing layer (spark) instead of the data storage layer (iceberg) is that you would get compatibility automatically with everything that spark can write to (in theory)
Anyways the to_spark_df
is sufficient for me, just some food for thought :)
@jettdc - Thanks for this feedback! I've updated the title of this issue slightly. I think to_spark()
would be a great addition, so I'm adding the labels good first issue
and accepting pull requests
.
The spark cache implementation is a bit nebulous as of now, but the to_spark()
approach seems like it could be a big win for data engineers. (We can optionally open another issue for spark / data lake cache...)