kedro-plugins
kedro-plugins copied to clipboard
[kedro-datasets] fsspec mixin
Description and context
I am looking into integrating the relatively new azureml-fsspec
bindings to allow loading (and potentially saving) of datasets directly from an azure machine learning studio datastore. This is slightly different to blob storage or adlfs and just got their 1.0.0 release recently. Due to their implementation of fsspec, when initialising the file system, one needs to pass the uri
of the dataset that is to be read. This is not necessary in any other fsspec implementation and I'm unsure if it is a feature or a but, this I am still investigating.
However, as part of this I figured that changing something in the fsspec
-filesystem initialisation, which is used in the majority of file-based datasets, would result in the need to copy and paste the changed code into all of the datasets that use that functionality. This is potentially error prone and could lead to accidental mistakes further down the line.
One solution I could come up with, and this certainly is not necessarily the right one, was providing a FileSystemMixin
class. This could be used to provide the fsspec
-filesystem capability in all datasets that need it while allowing for modifications if necessary. I guess one could use a function that returns the file system as well but I always liked the Mixin pattern for some reason. Also this would need further work in the datasets __init__
function that I have not thought through.
Happy to contribute a PR if this is of interest.
Possible Implementation
A minimal working example that I used to try out the azureml-fsspec
. The if self._protocol == "azureml": ...
can be ignored but would be necessary to support azureml-fsspec
in their current implementation.
class FileSystemMixin:
def __init__(self, protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
_fs_args = deepcopy(fs_args) or {}
_credentials = deepcopy(credentials) or {}
if protocol == "file":
_fs_args.setdefault("auto_mkdir", True)
self._protocol = protocol
self._storage_options = {**_credentials, **_fs_args}
if self._protocol == "azureml":
self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
else:
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)
def exists(self, path):
return self._fs.exists(path)
def glob(self, path):
return self._fs.glob(path)
Modified pandas.CSVDataset
omitting docstrings for brevity.
class CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame], FileSystemMixin):
DEFAULT_LOAD_ARGS = {} # type: Dict[str, Any]
DEFAULT_SAVE_ARGS = {"index": False} # type: Dict[str, Any]
def __init__(
self,
filepath: str,
load_args: Dict[str, Any] = None,
save_args: Dict[str, Any] = None,
version: Version = None,
credentials: Dict[str, Any] = None,
fs_args: Dict[str, Any] = None,
) -> None:
protocol, path = get_protocol_and_path(filepath, version)
FileSystemMixin.__init__(
self,
protocol,
filepath,
credentials,
fs_args
)
AbstractVersionedDataSet.__init__(
self,
filepath=PurePosixPath(path),
version=version,
exists_function=self.exists,
glob_function=self.glob,
)
# Handle default load and save arguments
self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
if load_args is not None:
self._load_args.update(load_args)
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)
if "storage_options" in self._save_args or "storage_options" in self._load_args:
logger.warning(
"Dropping 'storage_options' for %s, "
"please specify them under 'fs_args' or 'credentials'.",
self._filepath,
)
self._save_args.pop("storage_options", None)
self._load_args.pop("storage_options", None)
Possible Alternatives
Not tried but I guess something like this could work too, although I guess this would need some additional definitions in the Dataset __init__
such as : self._protocol
etc for use in load
or save
etc.
def create_fs(protocol: str, filepath: str, credentials: dict = None, fs_args: dict = None):
fs_args = deepcopy(fs_args) or {}
credentials = deepcopy(credentials) or {}
if protocol == "file":
fs_args.setdefault("auto_mkdir", True)
if protocol == "azureml":
fs = fsspec.filesystem(protocol, **credentials, **fs_args, uri=filepath)
else:
fs = fsspec.filesystem(protocol, **credentials, **fs_args)
return fs
Hi @fdroessler, thanks a lot for your interest in Kedro!
Sharing some thoughts here: Personally I'm not a big fan of mixins and multiple inheritance. If we went for a composition-over-inheritance approach maybe we could have something like an FSSpecAdapter
(bad name, bikesheddable) so that exists_function
and glob_function
is not part of the parent class, but of a new .adapter
property.
Beware that the engineering team might have a different opinion regarding this mixin. But more importantly, it's up to debate whether this relatively small refactor is worth taking in the face of https://github.com/kedro-org/kedro/issues/1778
Waiting for other colleagues to chime in.
Thanks @astrojuanlu, I was not aware of this major refactor. I guess this is some ways off though or do you have more details on the timeline? In case there is no point in refactoring either through multiple inheritance or composition, maybe a PR to allow (at least read) functionality for azureml fsspec would be considered?
@fdroessler https://github.com/kedro-org/kedro/milestone/12
Can you see this? I believe we open access for everyone but let me know if this is not the case.
We are aware of the datasets problem, it's not quite flexible as you can see the fsspec wasn't part of the Abstract class contract, but we almost used it everywhere.
Can you show an example why you need to refactor all datasets? In the past we have added new protocol and it's usually quite straight forward to do so, let's see if we can do something about it as this refactoring is not likely happen very soon.
@noklam Yes I can see that, thanks for pointing that out. The problem with azureml-fsspec is that they did not implement it similar to adfs or adl ... but such that you have to provide the url to the storage when instantiating the fs. Might be a bug but not sure tbh.
This leads to the following modifications on my side:
Also note that azureml-fsspec just supports reading at this point, they have an option to "upload" but that has to go through a separate method on the fs instance.
add azureml
to cloud protocols:
CLOUD_PROTOCOLS = ("s3", "s3n", "s3a", "gcs", "gs", "adl", "abfs", "abfss", "gdrive", "azureml")
and then:
if self._protocol == "azureml":
self._fs = fsspec.filesystem(self._protocol, **self._storage_options, uri=filepath)
else:
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)
This is what I would have to add to all the datasets in order to support access to them for azureml. This got me thinking on how to simplify modifications like this and avoid code redundancy. However, at the time of first writing I was not aware of the major refactor of datasets.
Nonetheless it would be great to have something added to support azureml-fsspec if possible. This does not have to be multiple inheritance or composition at this stage, I'm happy to add the code snippet in all the relevant datasets if necessary.
Might also be that the read-only restriction is too limited for kedro 🤷 however I am working also to add some functionality to kedro-azureml
plugin to better support remote and local execution.
@fdroessler Without thinking about it more, I generally like the idea of splitting some of these functionalities into mix-ins, as you propose. There's some precedent for that (see https://github.com/kedro-org/kedro/pull/15), even though the DefaultArgumentsMixIn
specifically seems to have been disappeared in 0.16.0.
Haven't consider the adapters proposal from @astrojuanlu at all; need to learn design patterns again to give good insight into mix-in vs. adapter vs. something else. 😅