polars
polars copied to clipboard
Add support for SFTP protocol
Description
The SFTP protocol isn't currently supported by polars per https://github.com/pola-rs/polars/issues/15811. The below command will return an error:
import polars as pl
import os
a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_parquet('example.parquet')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.parquet'
df = pl.scan_parquet(url)
Returns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 394, in scan_parquet
return _scan_parquet_impl(
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 441, in _scan_parquet_impl
scan = _scan_parquet_fsspec(source, storage_options) # type: ignore[arg-type]
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/anonymous_scan.py", line 21, in _scan_parquet_fsspec
schema = polars.io.parquet.read_parquet_schema(data)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 284, in read_parquet_schema
return _read_parquet_schema(source)
polars.exceptions.ComputeError: parquet: File out of specification: underlying IO error: 'NoneType' object cannot be interpreted as an integer
It appears this does work as-is with CSV files via read_csv
, but not with scan_csv
:
import polars as pl
import os
a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_csv('example.csv')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.csv'
df = pl.read_csv(url)
print(df)
yields the expected output
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
but with scan_csv
:
import polars as pl
import os
a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_csv('example.csv')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.csv'
df = pl.scan_csv(url)
print(df)
we get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/csv/functions.py", line 1103, in scan_csv
return _scan_csv_impl(
File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/csv/functions.py", line 1173, in _scan_csv_impl
pylf = PyLazyFrame.new_from_csv(
FileNotFoundError: No such file or directory (os error 2): sftp://ben@localhost/home/ben/example.csv
I think we should file this feature request in upstream(object_store), right?
@reswqa do you mean in arrow-rs, where this label exists: https://github.com/apache/arrow-rs/labels/object-store?
@benmayersohn
Yes, seems that object_store
crate located in arrow-rs repo.
Hello, I'm from the OpenDAL community. @reswqa brought this issue to my attention, and I'm here to share some information that could help us make progress on it.
OpenDAL offers a unified data access layer, empowering users to seamlessly and efficiently retrieve data from diverse storage services. Our goal is to deliver a comprehensive solution for any languages, methods, integrations, and services. It shares some similarities with object_store but has different goals and feature sets.
For adding sftp support, we have the following options:
Send feature request to object_store
Benefits: less work from our side.
Drawbacks: sftp is out of object_store
's scope, it's unlikely to be implemented.
Native OpenDAL Support
Adds native opendal support.
Benefits: More direct services support. Drawbacks: some extra work to make opendal work together with object_store
Use object_store_opendal
object_store_opendal
is an integration that maintained by opendal community to use opendal as an ObjectStore
.
Benefits: More services support. Drawbacks: Another layer, less control from ourside.
Updates about python side: polars python accepts fsspec
, opendal doesn't support yet: https://github.com/apache/opendal/issues/4511
Thanks for your comments! I didn't know about OpenDAL or its object store integration, so I appreciate the summary. I had mistakenly assumed that python polars would support SFTP because I do use fsspec
, which supports the SFTP filesystem. But all of this is a bit beyond my scope of understanding, so I'm not exactly sure what the best course of action would be from here.