iceberg-python
iceberg-python copied to clipboard
Consolidate FileIO
Feature Request / Improvement
Can we consolidate and standardize FileIO to the PyArrow implementation?
There are currently two different FileIO implementations, ARROW_FILE_IO and FSSPEC_FILE_IO. ARROW_FILE_IO uses Apache Arrow's Filesystem Interface while FSSPEC_FILE_IO uses the fsspec library.
Here are a few reasons for consolidating:
-
PyArrow is already preferred over FsSpec for various FS implementations. https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/io/init.py#L273-L282
-
PyIceberg is becoming more coupled with PyArrow,
to_arrow()andpa.Tableare widely used for reading and writing, including the new feature #305 -
Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (
/tmp/warehouse) to thefilescheme, but PyArrow does not. See #301 -
The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including
LocalFileSystem,S3FileSystem,GCSFileSystem, andAzureBlobFileSystem. While PyArrow uses its FS implementations includingLocalFileSystem,S3FileSystem,HadoopFileSystem, andGcsFileSystem. PyArrow is currently missing theHadoopFileSystemimplementation but it has support for HDFS. -
Fsspec and PyArrow can be used directionally PyArrow can use fsspec-based filesystem. FsSpec can wrap PyArrow filesystem.
What would be your proposal? The FileIO is an abstraction layer to use different implementations for your needs. For example, fsspec is lightweight compared to Arrow and might be preferred if you are inside of a lambda/cloud function or in an orchestration engine like Apache Airflow. As you mentioned, Arrow is more equipped to read tables. Next to that, PyIceberg is designed to be used as a library as part of a query engine. If that query engine prefers a different implementation to fetch the data from an object store, the FileIO abstraction layer allows for that.
I see. I was under the assumption that PyArrow could completely replace fsspec. But it seems like there are a few use cases where we would prefer fsspec.
fsspec is lightweight compared to Arrow
Looks like this is right; fsspec is a fraction of the size. https://pypi.org/project/fsspec/#files https://pypi.org/project/pyarrow/#files
Going forward, I think we can address (3) above and refactor fsspec and pyarrow to have the same specs and behaviors. And maybe also address (5) so that we can interchange fsspec and pyarrow easily.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
I see. I was under the assumption that PyArrow could completely replace fsspec. But it seems like there are a few use cases where we would prefer fsspec.
fsspec is lightweight compared to Arrow
Looks like this is right; fsspec is a fraction of the size. https://pypi.org/project/fsspec/#files https://pypi.org/project/pyarrow/#files
Going forward, I think we can address (3) above and refactor fsspec and pyarrow to have the same specs and behaviors. And maybe also address (5) so that we can interchange fsspec and pyarrow easily.
+1 We should use fsspec API everywhere, while making the usage of whichever implementation configurable.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'