kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

[kedro-datasets] fsspec mixin

Open fdroessler opened this issue 1 year ago • 5 comments

Description and context

I am looking into integrating the relatively new azureml-fsspec bindings to allow loading (and potentially saving) of datasets directly from an azure machine learning studio datastore. This is slightly different to blob storage or adlfs and just got their 1.0.0 release recently. Due to their implementation of fsspec, when initialising the file system, one needs to pass the uri of the dataset that is to be read. This is not necessary in any other fsspec implementation and I'm unsure if it is a feature or a but, this I am still investigating.

However, as part of this I figured that changing something in the fsspec-filesystem initialisation, which is used in the majority of file-based datasets, would result in the need to copy and paste the changed code into all of the datasets that use that functionality. This is potentially error prone and could lead to accidental mistakes further down the line.

One solution I could come up with, and this certainly is not necessarily the right one, was providing a FileSystemMixin class. This could be used to provide the fsspec-filesystem capability in all datasets that need it while allowing for modifications if necessary. I guess one could use a function that returns the file system as well but I always liked the Mixin pattern for some reason. Also this would need further work in the datasets __init__ function that I have not thought through.

Happy to contribute a PR if this is of interest.

Possible Implementation

A minimal working example that I used to try out the azureml-fsspec. The if self._protocol == "azureml": ... can be ignored but would be necessary to support azureml-fsspec in their current implementation.

class FileSystemMixin:
    def __init__(self, protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
        _fs_args = deepcopy(fs_args) or {}
        _credentials = deepcopy(credentials) or {}

        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._protocol = protocol
        self._storage_options = {**_credentials, **_fs_args}
        if self._protocol == "azureml":
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
        else:    
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

    def exists(self, path):
        return self._fs.exists(path)

    def glob(self, path):
        return self._fs.glob(path)

Modified pandas.CSVDataset omitting docstrings for brevity.

class CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame], FileSystemMixin):
    
    DEFAULT_LOAD_ARGS = {}  # type: Dict[str, Any]
    DEFAULT_SAVE_ARGS = {"index": False}  # type: Dict[str, Any]

    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
    ) -> None:
        protocol, path = get_protocol_and_path(filepath, version)
        FileSystemMixin.__init__(
            self,
            protocol,
            filepath,
            credentials,
            fs_args
        )
        AbstractVersionedDataSet.__init__(
            self,
            filepath=PurePosixPath(path),
            version=version,
            exists_function=self.exists,
            glob_function=self.glob,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        if "storage_options" in self._save_args or "storage_options" in self._load_args:
            logger.warning(
                "Dropping 'storage_options' for %s, "
                "please specify them under 'fs_args' or 'credentials'.",
                self._filepath,
            )
            self._save_args.pop("storage_options", None)
            self._load_args.pop("storage_options", None)

Possible Alternatives

Not tried but I guess something like this could work too, although I guess this would need some additional definitions in the Dataset __init__ such as : self._protocol etc for use in load or save etc.

def create_fs(protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
        fs_args = deepcopy(fs_args) or {}
        credentials = deepcopy(credentials) or {}

        if protocol == "file":
            fs_args.setdefault("auto_mkdir", True)

        if protocol == "azureml":
            fs = fsspec.filesystem(protocol, **credentials, **fs_args, uri=filepath)
        else:    
            fs = fsspec.filesystem(protocol, **credentials, **fs_args)
        return fs

fdroessler avatar May 07 '23 15:05 fdroessler

Hi @fdroessler, thanks a lot for your interest in Kedro!

Sharing some thoughts here: Personally I'm not a big fan of mixins and multiple inheritance. If we went for a composition-over-inheritance approach maybe we could have something like an FSSpecAdapter (bad name, bikesheddable) so that exists_function and glob_function is not part of the parent class, but of a new .adapter property.

Beware that the engineering team might have a different opinion regarding this mixin. But more importantly, it's up to debate whether this relatively small refactor is worth taking in the face of https://github.com/kedro-org/kedro/issues/1778

Waiting for other colleagues to chime in.

astrojuanlu avatar May 08 '23 16:05 astrojuanlu

Thanks @astrojuanlu, I was not aware of this major refactor. I guess this is some ways off though or do you have more details on the timeline? In case there is no point in refactoring either through multiple inheritance or composition, maybe a PR to allow (at least read) functionality for azureml fsspec would be considered?

fdroessler avatar May 14 '23 17:05 fdroessler

@fdroessler https://github.com/kedro-org/kedro/milestone/12

Can you see this? I believe we open access for everyone but let me know if this is not the case.

We are aware of the datasets problem, it's not quite flexible as you can see the fsspec wasn't part of the Abstract class contract, but we almost used it everywhere.

Can you show an example why you need to refactor all datasets? In the past we have added new protocol and it's usually quite straight forward to do so, let's see if we can do something about it as this refactoring is not likely happen very soon.

noklam avatar May 15 '23 09:05 noklam

@noklam Yes I can see that, thanks for pointing that out. The problem with azureml-fsspec is that they did not implement it similar to adfs or adl ... but such that you have to provide the url to the storage when instantiating the fs. Might be a bug but not sure tbh.

This leads to the following modifications on my side:

Also note that azureml-fsspec just supports reading at this point, they have an option to "upload" but that has to go through a separate method on the fs instance.

add azureml to cloud protocols: CLOUD_PROTOCOLS = ("s3", "s3n", "s3a", "gcs", "gs", "adl", "abfs", "abfss", "gdrive", "azureml")

and then:

        if self._protocol == "azureml":
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
        else:    
            self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

This is what I would have to add to all the datasets in order to support access to them for azureml. This got me thinking on how to simplify modifications like this and avoid code redundancy. However, at the time of first writing I was not aware of the major refactor of datasets.

Nonetheless it would be great to have something added to support azureml-fsspec if possible. This does not have to be multiple inheritance or composition at this stage, I'm happy to add the code snippet in all the relevant datasets if necessary.

Might also be that the read-only restriction is too limited for kedro 🤷 however I am working also to add some functionality to kedro-azureml plugin to better support remote and local execution.

fdroessler avatar May 19 '23 08:05 fdroessler

@fdroessler Without thinking about it more, I generally like the idea of splitting some of these functionalities into mix-ins, as you propose. There's some precedent for that (see https://github.com/kedro-org/kedro/pull/15), even though the DefaultArgumentsMixIn specifically seems to have been disappeared in 0.16.0.

Haven't consider the adapters proposal from @astrojuanlu at all; need to learn design patterns again to give good insight into mix-in vs. adapter vs. something else. 😅

deepyaman avatar Jun 07 '23 13:06 deepyaman