polars icon indicating copy to clipboard operation
polars copied to clipboard

Add support for SFTP protocol

Open benmayersohn opened this issue 10 months ago • 5 comments

Description

The SFTP protocol isn't currently supported by polars per https://github.com/pola-rs/polars/issues/15811. The below command will return an error:

import polars as pl
import os

a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_parquet('example.parquet')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.parquet'
df = pl.scan_parquet(url)

Returns

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 394, in scan_parquet
    return _scan_parquet_impl(
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 441, in _scan_parquet_impl
    scan = _scan_parquet_fsspec(source, storage_options)  # type: ignore[arg-type]
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/anonymous_scan.py", line 21, in _scan_parquet_fsspec
    schema = polars.io.parquet.read_parquet_schema(data)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 284, in read_parquet_schema
    return _read_parquet_schema(source)
polars.exceptions.ComputeError: parquet: File out of specification: underlying IO error: 'NoneType' object cannot be interpreted as an integer

It appears this does work as-is with CSV files via read_csv, but not with scan_csv:

import polars as pl
import os

a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_csv('example.csv')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.csv'
df = pl.read_csv(url)
print(df)

yields the expected output

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

but with scan_csv:

import polars as pl
import os

a = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a.write_csv('example.csv')
username = os.environ['USER']
currdir = os.getcwd()
url = f'sftp://{username}@localhost/{currdir[1:]}/example.csv'
df = pl.scan_csv(url)
print(df)

we get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/csv/functions.py", line 1103, in scan_csv
    return _scan_csv_impl(
  File "/home/ben/test_venv/lib/python3.10/site-packages/polars/io/csv/functions.py", line 1173, in _scan_csv_impl
    pylf = PyLazyFrame.new_from_csv(
FileNotFoundError: No such file or directory (os error 2): sftp://ben@localhost/home/ben/example.csv

benmayersohn avatar Apr 21 '24 15:04 benmayersohn

I think we should file this feature request in upstream(object_store), right?

reswqa avatar Apr 22 '24 07:04 reswqa

@reswqa do you mean in arrow-rs, where this label exists: https://github.com/apache/arrow-rs/labels/object-store?

benmayersohn avatar Apr 22 '24 13:04 benmayersohn

@benmayersohn

Yes, seems that object_store crate located in arrow-rs repo.

reswqa avatar Apr 23 '24 03:04 reswqa

Hello, I'm from the OpenDAL community. @reswqa brought this issue to my attention, and I'm here to share some information that could help us make progress on it.

OpenDAL offers a unified data access layer, empowering users to seamlessly and efficiently retrieve data from diverse storage services. Our goal is to deliver a comprehensive solution for any languages, methods, integrations, and services. It shares some similarities with object_store but has different goals and feature sets.

For adding sftp support, we have the following options:

Send feature request to object_store

Benefits: less work from our side. Drawbacks: sftp is out of object_store's scope, it's unlikely to be implemented.

Native OpenDAL Support

Adds native opendal support.

Benefits: More direct services support. Drawbacks: some extra work to make opendal work together with object_store

Use object_store_opendal

object_store_opendal is an integration that maintained by opendal community to use opendal as an ObjectStore.

Benefits: More services support. Drawbacks: Another layer, less control from ourside.


Updates about python side: polars python accepts fsspec, opendal doesn't support yet: https://github.com/apache/opendal/issues/4511

Xuanwo avatar Apr 23 '24 10:04 Xuanwo

Thanks for your comments! I didn't know about OpenDAL or its object store integration, so I appreciate the summary. I had mistakenly assumed that python polars would support SFTP because I do use fsspec, which supports the SFTP filesystem. But all of this is a bit beyond my scope of understanding, so I'm not exactly sure what the best course of action would be from here.

benmayersohn avatar Apr 25 '24 13:04 benmayersohn