prefect icon indicating copy to clipboard operation
prefect copied to clipboard

RemoteFileSystem doesn't allow all filesystems provided by fsspec

Open knl opened this issue 2 years ago • 9 comments

First check

  • [X] I added a descriptive title to this issue.
  • [X] I used the GitHub search to find a similar issue and didn't find it.
  • [X] I searched the Prefect documentation for this issue.
  • [X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

RemoteFileSystem is said to support all filesystems implemented by fsspec. However, it fails with arrow_hdfs, that is a replacement for hdfs.

The reason is the prefect uses the following code to parse the url

    @validator("basepath")
    def check_basepath(cls, value):
        scheme, netloc, _, _, _ = urllib.parse.urlsplit(value)

While fsspec uses custom parsing. The complete list of filesystems is given here.

Reproduction

from prefect.filesystems import RemoteFileSystem

storage_block = RemoteFileSystem(
    basepath="arrow_hdfs://server/path",
    settings={
        "port": 8020,
    },
)

Error

Script at 'blocks/storage.py' encountered an exception
Traceback (most recent call last):
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "blocks/storage.py", line 3, in <module>
    storage_block = RemoteFileSystem(
  File "/Users/me/work/gitlab/me/india-data-pipeline/.venv/lib/python3.9/site-packages/prefect/blocks/core.py", line 172, in __init__
    super().__init__(*args, **kwargs)
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for RemoteFileSystem
basepath
  Base path must start with a scheme. Got 'arrow_hdfs://server/path'. (type=value_error)

Versions

Version:             2.4.0
API version:         0.8.0
Python version:      3.9.13
Git commit:          513639e8
Built:               Tue, Sep 13, 2022 2:15 PM
OS/Arch:             darwin/x86_64
Profile:             default
Server type:         hosted

Additional context

No response

knl avatar Sep 19 '22 12:09 knl

Thanks for the issue and research @knl! That definitely does look like the source of the bug.

bunchesofdonald avatar Sep 19 '22 14:09 bunchesofdonald

@bunchesofdonald I think I can fix this. Will I have to replace the native module urllib.parse with rfc3986 used by HTTPX.

hallenmaia avatar Sep 20 '22 12:09 hallenmaia

@hallenmaia if you want to work on fixing this that would be fantastic. I did check out rfc3986 it doesn't appear to parse it 'correctly' either:

>>> from rfc3986 import urlparse
>>> urlparse('arrow_hdfs://server/path') 
ParseResult(scheme=None, userinfo=None, host=None, port=None, path='arrow_hdfs://server/path', query=None, fragment=None)

Here are some approaches that might work:

  • The furl library.
  • Trying to manually parse the scheme and passing the //server/path part to urlparse.
  • Replacing the _ in the scheme with - seems to have urllib recognize it, could be a path there.

I would probably start with furl and see if it does what we're wanting. Let us know if you need some more guidance, and thanks again for looking into this!

bunchesofdonald avatar Sep 20 '22 14:09 bunchesofdonald

A valid scheme must contain: [a-zA-Z][a-zA-Z0-9+.-]. Eg:

  • https://github.com/python/cpython/blob/96739bccf220689a54ef33341f431eda19c287fa/Lib/urllib/parse.py#L76
  • https://github.com/python-hyper/rfc3986/blob/164073434fe530e14ceca4ff6a0683e23e2fde48/src/rfc3986/abnf_regexp.py#L41
  • https://www.rfc-editor.org/rfc/rfc3986#section-3.1

In fspec, some implementations have the character "_" in the protocol (schema). But even he fails in this part:

from fsspec.utils import infer_storage_options
infer_storage_options("arrow_hdfs://server/path")
{'protocol': 'file', 'path': 'arrow_hdfs://server/path'}

If I remove the "_" from the schema it validates correctly:

from fsspec.utils import infer_storage_options
infer_storage_options("arrowhdfs://server/path")
{'protocol': 'arrowhdfs', 'path': '/path', 'host': 'server'}

I'm looking for a way to properly validate this.

hallenmaia avatar Sep 20 '22 14:09 hallenmaia

@bunchesofdonald furl fails too: https://github.com/gruns/furl/blob/774846234ff803606fdd289a7549f9b50b2b3677/furl/furl.py#L227

hallenmaia avatar Sep 20 '22 15:09 hallenmaia

Right, we'll probably have to validate the schema against fsspec's known_implementations, which is probably more accurate anyway, since we don't want to accept valid schemes that aren't part of fsspec, like dns or irc . I think something like this would work to do that:

from fsspec.registry import known_implementations
from furl import furl
parsed = furl("arrow_hdfs://server/path")
is_valid_scheme = parsed.scheme in set(known_implementations)

bunchesofdonald avatar Sep 20 '22 15:09 bunchesofdonald

I opened a issue in the fspec repository to check this.

https://github.com/fsspec/filesystem_spec/issues/1050

The protocol arrow_hdfs is a replacement for hdfs and there is a possibility that it becomes just arrow or hdfs.

hallenmaia avatar Sep 20 '22 17:09 hallenmaia

@bunchesofdonald The change will be in fspec. No changes to the Prefect will be necessary.

hallenmaia avatar Sep 20 '22 18:09 hallenmaia

That's great! Thanks for looking into that and getting it dealt with @hallenmaia

bunchesofdonald avatar Sep 20 '22 18:09 bunchesofdonald