kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

urlopen error when using an SFTP path

Open TristanFauvel opened this issue 2 years ago • 3 comments
trafficstars

Description

When adding my_data to the DataCatalog with an SFTP path:

my_data:
  type: pandas.CSVDataSet  
  filepath: "sftp://<host>/<path>/<filename>.csv"
  credentials : my_credentials

I get :

URLError: <urlopen error unknown url type: sftp>

Context

I am trying to load a .csv file from a server using SFTP. Creating the following custom SFTPDataSet class solved the issue :

class SFTPDataSet(CSVDataSet):
    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
        metadata: Dict[str, Any] = None,
    ) -> None:
        super().__init__(
            filepath, load_args, save_args, version, credentials, fs_args, metadata
        )

    def _load(self) -> pd.DataFrame:
        load_path = str(self._get_load_path())
        if self._protocol == "file":
            return pd.read_csv(load_path, **self._load_args)

        load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"

        sftp = self._fs

        with sftp.open(load_path) as f:
            data = pd.read_csv(f, **self._load_args)

        return data

Steps to Reproduce

  1. Add a dataset to the DataCatalog with an SFTP path, and add the credentials in conf/local
  2. Create a node that loads the data in a pipeline

Expected Result

The .csv should be loaded into a pandas dataframe.

Actual Result

Instead I get:

URLError: <urlopen error unknown url type: sftp>

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.13
  • Python version used (python -V): Python 3.10.12
  • Operating system and version: Windows 10 Pro

TristanFauvel avatar Sep 01 '23 08:09 TristanFauvel

Hi @TristanFauvel Kedro already supports sftp via all the datasets implemented with fsspec (and paramiko underneath), see an example here:

https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-a-csv-file-stored-in-a-remote-location-through-ssh

datajoely avatar Sep 01 '23 09:09 datajoely

Hi @datajoely, Thanks for the quick reply. Actually, I did follow the example you linked (this is not a feature request).

I noticed that the bug occurs in CSVDataSet's _load(). Replacing:

pd.read_csv(load_path, storage_options=self._storage_options, **self._load_args)

with :

with  self._fs.open(load_path) as f:
    data = pd.read_csv(f, **self._load_args)

solved the bug (as I did in the SFTPDataSet class above).

pandas version : I got the bug with both 2.0.3 and 2.1.0

TristanFauvel avatar Sep 01 '23 09:09 TristanFauvel

Hi @TristanFauvel, do you need more help with this issue or can it be closed?

merelcht avatar Jan 12 '24 14:01 merelcht