iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Consolidate FileIO

Open kevinjqliu opened this issue 1 year ago • 5 comments

Feature Request / Improvement

Can we consolidate and standardize FileIO to the PyArrow implementation?

There are currently two different FileIO implementations, ARROW_FILE_IO and FSSPEC_FILE_IO. ARROW_FILE_IO uses Apache Arrow's Filesystem Interface while FSSPEC_FILE_IO uses the fsspec library.

Here are a few reasons for consolidating:

  1. PyArrow is already preferred over FsSpec for various FS implementations. https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/io/init.py#L273-L282

  2. PyIceberg is becoming more coupled with PyArrow, to_arrow() and pa.Table are widely used for reading and writing, including the new feature #305

  3. Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (/tmp/warehouse) to the file scheme, but PyArrow does not. See #301

  4. The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including LocalFileSystem, S3FileSystem, GCSFileSystem, and AzureBlobFileSystem. While PyArrow uses its FS implementations including LocalFileSystem, S3FileSystem, HadoopFileSystem, and GcsFileSystem. PyArrow is currently missing the HadoopFileSystem implementation but it has support for HDFS.

  5. Fsspec and PyArrow can be used directionally PyArrow can use fsspec-based filesystem. FsSpec can wrap PyArrow filesystem.

kevinjqliu avatar Jan 27 '24 00:01 kevinjqliu

What would be your proposal? The FileIO is an abstraction layer to use different implementations for your needs. For example, fsspec is lightweight compared to Arrow and might be preferred if you are inside of a lambda/cloud function or in an orchestration engine like Apache Airflow. As you mentioned, Arrow is more equipped to read tables. Next to that, PyIceberg is designed to be used as a library as part of a query engine. If that query engine prefers a different implementation to fetch the data from an object store, the FileIO abstraction layer allows for that.

Fokko avatar Jan 27 '24 20:01 Fokko

I see. I was under the assumption that PyArrow could completely replace fsspec. But it seems like there are a few use cases where we would prefer fsspec.

fsspec is lightweight compared to Arrow

Looks like this is right; fsspec is a fraction of the size. https://pypi.org/project/fsspec/#files https://pypi.org/project/pyarrow/#files

Going forward, I think we can address (3) above and refactor fsspec and pyarrow to have the same specs and behaviors. And maybe also address (5) so that we can interchange fsspec and pyarrow easily.

kevinjqliu avatar Feb 05 '24 17:02 kevinjqliu

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 04 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Aug 18 '24 00:08 github-actions[bot]

I see. I was under the assumption that PyArrow could completely replace fsspec. But it seems like there are a few use cases where we would prefer fsspec.

fsspec is lightweight compared to Arrow

Looks like this is right; fsspec is a fraction of the size. https://pypi.org/project/fsspec/#files https://pypi.org/project/pyarrow/#files

Going forward, I think we can address (3) above and refactor fsspec and pyarrow to have the same specs and behaviors. And maybe also address (5) so that we can interchange fsspec and pyarrow easily.

+1 We should use fsspec API everywhere, while making the usage of whichever implementation configurable.

TiansuYu avatar Sep 02 '24 16:09 TiansuYu

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Mar 02 '25 00:03 github-actions[bot]

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 30 '25 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 14 '25 00:09 github-actions[bot]