`pandas.CSVDataSet` with remote filepaths cannot be pickled

Open astrojuanlu opened this issue 2 years ago • 1 comments

Description

As per title.

cc @jmnunezd

Context

As a result, pandas.CSVDataSet with remote filepaths cannot be used with ParallelRunner.

Steps to Reproduce

>>> from kedro_datasets.pandas import CSVDataSet
>>> ds = CSVDataSet("https://google.com/data.csv")
>>> from multiprocessing.reduction import ForkingPickler
>>> ForkingPickler.dumps(ds)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/juan_cano/.local/share/rtx/installs/python/3.10.11/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function HTTPFileSystem._exists at 0x124f5f1c0>: it's not the same object as fsspec.implementations.http.HTTPFileSystem._exists

Expected Result

Datasets with remote filepaths behave in the same way as datasets with local filepaths:

>>> ds_ok = CSVDataSet("/tmp/data.csv")
>>> ForkingPickler.dumps(ds_ok)
<memory at 0x124dd13c0>

This could be considered a feature request, rather than a bug. But it was surprising that the nature of the filepath could influence the result.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.11
Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-datasets 1.4.2
Python version used (python -V): 3.10.11
Operating system and version: macOS Ventura

Jul 14 '23 15:07 astrojuanlu

Looks like this one has the same source of the problem as https://github.com/kedro-org/kedro/issues/2162, so it makes sense to investigate them together and suggest a general fix for all datasets.

Jan 02 '25 10:01 ElenaKhaustova