prefect
prefect copied to clipboard
RemoteFileSystem doesn't allow all filesystems provided by fsspec
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar issue and didn't find it.
- [X] I searched the Prefect documentation for this issue.
- [X] I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
RemoteFileSystem is said to support all filesystems implemented by fsspec. However, it fails with arrow_hdfs
, that is a replacement for hdfs
.
The reason is the prefect uses the following code to parse the url
@validator("basepath")
def check_basepath(cls, value):
scheme, netloc, _, _, _ = urllib.parse.urlsplit(value)
While fsspec uses custom parsing. The complete list of filesystems is given here.
Reproduction
from prefect.filesystems import RemoteFileSystem
storage_block = RemoteFileSystem(
basepath="arrow_hdfs://server/path",
settings={
"port": 8020,
},
)
Error
Script at 'blocks/storage.py' encountered an exception
Traceback (most recent call last):
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "blocks/storage.py", line 3, in <module>
storage_block = RemoteFileSystem(
File "/Users/me/work/gitlab/me/india-data-pipeline/.venv/lib/python3.9/site-packages/prefect/blocks/core.py", line 172, in __init__
super().__init__(*args, **kwargs)
File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for RemoteFileSystem
basepath
Base path must start with a scheme. Got 'arrow_hdfs://server/path'. (type=value_error)
Versions
Version: 2.4.0
API version: 0.8.0
Python version: 3.9.13
Git commit: 513639e8
Built: Tue, Sep 13, 2022 2:15 PM
OS/Arch: darwin/x86_64
Profile: default
Server type: hosted
Additional context
No response
Thanks for the issue and research @knl! That definitely does look like the source of the bug.
@bunchesofdonald I think I can fix this. Will I have to replace the native module urllib.parse
with rfc3986 used by HTTPX.
@hallenmaia if you want to work on fixing this that would be fantastic. I did check out rfc3986
it doesn't appear to parse it 'correctly' either:
>>> from rfc3986 import urlparse
>>> urlparse('arrow_hdfs://server/path')
ParseResult(scheme=None, userinfo=None, host=None, port=None, path='arrow_hdfs://server/path', query=None, fragment=None)
Here are some approaches that might work:
- The furl library.
- Trying to manually parse the scheme and passing the
//server/path
part to urlparse. - Replacing the
_
in the scheme with-
seems to have urllib recognize it, could be a path there.
I would probably start with furl
and see if it does what we're wanting. Let us know if you need some more guidance, and thanks again for looking into this!
A valid scheme must contain: [a-zA-Z][a-zA-Z0-9+.-]. Eg:
- https://github.com/python/cpython/blob/96739bccf220689a54ef33341f431eda19c287fa/Lib/urllib/parse.py#L76
- https://github.com/python-hyper/rfc3986/blob/164073434fe530e14ceca4ff6a0683e23e2fde48/src/rfc3986/abnf_regexp.py#L41
- https://www.rfc-editor.org/rfc/rfc3986#section-3.1
In fspec, some implementations have the character "_" in the protocol (schema). But even he fails in this part:
from fsspec.utils import infer_storage_options
infer_storage_options("arrow_hdfs://server/path")
{'protocol': 'file', 'path': 'arrow_hdfs://server/path'}
If I remove the "_" from the schema it validates correctly:
from fsspec.utils import infer_storage_options
infer_storage_options("arrowhdfs://server/path")
{'protocol': 'arrowhdfs', 'path': '/path', 'host': 'server'}
I'm looking for a way to properly validate this.
@bunchesofdonald furl fails too: https://github.com/gruns/furl/blob/774846234ff803606fdd289a7549f9b50b2b3677/furl/furl.py#L227
Right, we'll probably have to validate the schema against fsspec's known_implementations
, which is probably more accurate anyway, since we don't want to accept valid schemes that aren't part of fsspec, like dns
or irc
. I think something like this would work to do that:
from fsspec.registry import known_implementations
from furl import furl
parsed = furl("arrow_hdfs://server/path")
is_valid_scheme = parsed.scheme in set(known_implementations)
I opened a issue in the fspec
repository to check this.
https://github.com/fsspec/filesystem_spec/issues/1050
The protocol arrow_hdfs
is a replacement for hdfs
and there is a possibility that it becomes just arrow
or hdfs
.
@bunchesofdonald The change will be in fspec
. No changes to the Prefect
will be necessary.
That's great! Thanks for looking into that and getting it dealt with @hallenmaia