kedro-plugins
kedro-plugins copied to clipboard
urlopen error when using an SFTP path
Description
When adding my_data to the DataCatalog with an SFTP path:
my_data:
type: pandas.CSVDataSet
filepath: "sftp://<host>/<path>/<filename>.csv"
credentials : my_credentials
I get :
URLError: <urlopen error unknown url type: sftp>
Context
I am trying to load a .csv file from a server using SFTP. Creating the following custom SFTPDataSet class solved the issue :
class SFTPDataSet(CSVDataSet):
def __init__(
self,
filepath: str,
load_args: Dict[str, Any] = None,
save_args: Dict[str, Any] = None,
version: Version = None,
credentials: Dict[str, Any] = None,
fs_args: Dict[str, Any] = None,
metadata: Dict[str, Any] = None,
) -> None:
super().__init__(
filepath, load_args, save_args, version, credentials, fs_args, metadata
)
def _load(self) -> pd.DataFrame:
load_path = str(self._get_load_path())
if self._protocol == "file":
return pd.read_csv(load_path, **self._load_args)
load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"
sftp = self._fs
with sftp.open(load_path) as f:
data = pd.read_csv(f, **self._load_args)
return data
Steps to Reproduce
- Add a dataset to the DataCatalog with an SFTP path, and add the credentials in conf/local
- Create a node that loads the data in a pipeline
Expected Result
The .csv should be loaded into a pandas dataframe.
Actual Result
Instead I get:
URLError: <urlopen error unknown url type: sftp>
Your Environment
- Kedro version used (
pip show kedroorkedro -V): kedro, version 0.18.13 - Python version used (
python -V): Python 3.10.12 - Operating system and version: Windows 10 Pro
Hi @TristanFauvel Kedro already supports sftp via all the datasets implemented with fsspec (and paramiko underneath), see an example here:
https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-a-csv-file-stored-in-a-remote-location-through-ssh
Hi @datajoely, Thanks for the quick reply. Actually, I did follow the example you linked (this is not a feature request).
I noticed that the bug occurs in CSVDataSet's _load(). Replacing:
pd.read_csv(load_path, storage_options=self._storage_options, **self._load_args)
with :
with self._fs.open(load_path) as f:
data = pd.read_csv(f, **self._load_args)
solved the bug (as I did in the SFTPDataSet class above).
pandas version : I got the bug with both 2.0.3 and 2.1.0
Hi @TristanFauvel, do you need more help with this issue or can it be closed?