kedro-plugins
kedro-plugins copied to clipboard
`pandas.CSVDataSet` with remote filepaths cannot be pickled
Description
As per title.
cc @jmnunezd
Context
As a result, pandas.CSVDataSet with remote filepaths cannot be used with ParallelRunner.
Steps to Reproduce
>>> from kedro_datasets.pandas import CSVDataSet
>>> ds = CSVDataSet("https://google.com/data.csv")
>>> from multiprocessing.reduction import ForkingPickler
>>> ForkingPickler.dumps(ds)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/juan_cano/.local/share/rtx/installs/python/3.10.11/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function HTTPFileSystem._exists at 0x124f5f1c0>: it's not the same object as fsspec.implementations.http.HTTPFileSystem._exists
Expected Result
Datasets with remote filepaths behave in the same way as datasets with local filepaths:
>>> ds_ok = CSVDataSet("/tmp/data.csv")
>>> ForkingPickler.dumps(ds_ok)
<memory at 0x124dd13c0>
This could be considered a feature request, rather than a bug. But it was surprising that the nature of the filepath could influence the result.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedroorkedro -V): 0.18.11 - Kedro plugin and kedro plugin version used (
pip show kedro-airflow): kedro-datasets 1.4.2 - Python version used (
python -V): 3.10.11 - Operating system and version: macOS Ventura
Looks like this one has the same source of the problem as https://github.com/kedro-org/kedro/issues/2162, so it makes sense to investigate them together and suggest a general fix for all datasets.