universal_pathlib
universal_pathlib copied to clipboard
Add support for conversion from str in pydantic-settings
from pathlib import Path
from upath import UPath
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
my_path: Path
my_upath: UPath
settings = Settings(my_path="/tmp", my_upath="/tmp")
❯ python example.py
Traceback (most recent call last):
File "/home/bart/src/proj/example.py", line 9, in <module>
settings = Settings(my_path="/tmp", my_upath="/tmp")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bart/src/proj/.venv/lib/python3.12/site-packages/pydantic_settings/main.py", line 152, in __init__
super().__init__(
File "/home/bart/src/proj/.venv/lib/python3.12/site-packages/pydantic/main.py", line 212, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
my_upath
Input should be an instance of UPath [type=is_instance_of, input_value='/tmp', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/is_instance_of
When reading through some older (solved) issues, I got the impression that this used to work. Should this work out of the box?
pydantic 2.9.2 Data validation using Python type ...
pydantic-core 2.23.4 Core functionality for Pydantic va...
pydantic-settings 2.5.2 Settings management using Pydantic
universal-pathlib 0.2.5 pathlib api extended to use fsspec...
Hi @BartSchuurmans
Regarding the implementation, imo behavior here is correct. This only works for your pathlib.Path attributes because of lax mode: https://docs.pydantic.dev/latest/concepts/conversion_table/#__tabbed_1_1
See:
>>> from pathlib import Path
... from pydantic import ConfigDict
... from pydantic_settings import BaseSettings
...
>>> class Settings(BaseSettings):
... model_config = ConfigDict(strict=True)
... my_path: Path
...
>>> Settings(my_path="/tmp")
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
Settings(my_path="/tmp")
~~~~~~~~^^^^^^^^^^^^^^^^
File "/Users/andreaspoehlmann/Development/universal_pathlib/venv313/lib/python3.13/site-packages/pydantic_settings/main.py", line 144, in __init__
super().__init__(
~~~~~~~~~~~~~~~~^
**__pydantic_self__._settings_build_values(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<20 lines>...
)
^
)
^
File "/Users/andreaspoehlmann/Development/universal_pathlib/venv313/lib/python3.13/site-packages/pydantic/main.py", line 211, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
my_path
Input should be an instance of Path [type=is_instance_of, input_value='/tmp', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/is_instance_of
I can see how users would expect to be able to just cast from string by default though. You can do this right now via: https://docs.pydantic.dev/latest/concepts/validators/#before-after-wrap-and-plain-validators . But I agree, that this should be more convenient. We should make sure to expose storage_options in some way to pydantic too.
@ap-- Thanks for your quick response!
I am using the PlainValidator now to support converting str to UPath:
from typing import Annotated
from pydantic import PlainValidator
from upath import UPath
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
my_upath: Annotated[UPath, PlainValidator(lambda x: UPath(x))]
settings = Settings(my_upath="/tmp")
Here is a slightly more complete variant that also allows you to serialize to a string again:
from typing import Annotated, Any
import pydantic
from pydantic_core import core_schema
from upath import UPath
class UPathAnnotation:
@classmethod
def __get_pydantic_core_schema__(
cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
) -> core_schema.CoreSchema:
from_str_schema = core_schema.chain_schema(
[
core_schema.str_schema(),
core_schema.no_info_plain_validator_function(UPath),
]
)
return core_schema.json_or_python_schema(
json_schema=from_str_schema,
python_schema=core_schema.union_schema(
[
core_schema.is_instance_schema(UPath),
from_str_schema,
]
),
serialization=core_schema.to_string_ser_schema(),
)
class MyModel(pydantic.BaseModel):
path: Annotated[UPath, UPathAnnotation]
my_model = MyModel(path=UPath("s3://my-bucket"))
print(my_model)
my_model_json = my_model.model_dump_json()
print(my_model_json)
my_model_roundtrip = MyModel.model_validate_json(my_model_json)
print(my_model_roundtrip)
path=S3Path('s3://my-bucket/')
{"path":"s3://my-bucket/"}
path=S3Path('s3://my-bucket/')
Would you be open to integrate this into the UPath class optionally when pydantic is available in the environment? It would look something like this:
try:
import pydantic
from pydantic_core import core_schema
PYDANTIC_2_AVAILABLE = tuple(map(int, pydantic.__version__.split(".")[:3])) >= (2,)
except (ImportError, AttributeError):
PYDANTIC_2_AVAILABLE = False
class UPath:
if PYDANTIC_2_AVAILABLE:
@classmethod
def __get_pydantic_core_schema__(
cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
) -> core_schema.CoreSchema: ...
With this in place, anyone that already is using pydantic can use UPath natively, and we don't need to introduce any dependency on pydantic.
Happy to send a PR if this proposal is accepted.
@pmeier, I would support adding this if it makes working with UPath and pydantic easier. Does your above implementation handle storage options as well?
@ap--, any thoughts on the above?
@andrewfulton9
Does your above implementation handle storage options as well?
It does not, but it is a straightforward addition. Can I extract the storage options from UPath(...).storage_options and set them again through UPath(..., storage_options=storage_options)?
Is there anything else that is required to (de-)serialize a UPath other than its string representation and the storage options?
I'm not sure it will be that straight forward. I think one gotcha could be that they aren't necessarily always going to be standard type objects. For instance, the gcsfs file system can take a google.auth.credentials.Credentials object for the token key word argument which may make (de-)serialization tricky in this case.
Makes sense. In that case my suggestion above is likely off the table. pydantic requires a known schema for serialization. I thought that was just a str and maybe a dict with with native JSON types for the storage options. If that is not true, and it looks like its not, there is little we can do here.
Hi @pmeier and @andrewfulton9
Would you be open to integrate this into the UPath class optionally when pydantic is available in the environment?
Very much so. Yes.
Regarding serialization: To be able to correctly represent a UPath, you basically have to consider UPath().protocol, UPath().path and UPath().storage_options. In many cases users interact with universal-pathlib (and with fsspec) in terms of "urlpaths", which can be thought of as the uri string version of the three combined. The caveat here is that for many filesystems not all information can be represented as a urlpath. So the conversion from
protocol: str
path: str
storage_options: Mapping[str, Any] # could be more specficially typed per filesystem
to urlpath: str is lossy.
case 1
That being said, there are many cases in which a trivial representation of:
f"{protocol}://{path}" # i.e. memory:///a/b/c, file:///a/b/c, ...
is enough, because storage_options are most likely empty.
case 2
Then there are a few cases where the relevant storage_options can be easily serialized, because they have values of types that can be represented easily in json.
UPath("s3:///bucket/a/b/c", anon=True)
case 3
Some storage options are more complicated to serialize, examples are the ones that @andrewfulton9 mentioned in the comment above. We could think of options where we fall back to pickling those, or drop them from the serialization.
case 4
Once chaining support is fully available storage_option serialization is crucial, because the chained filesystems are nested in the storage_options.
protocol = "simplecache"
path = "/a/b/c"
storage_options={
"target_protocol": "memory"
"target_options": {"someoption": "somevalue", ...}
}
support
I think we can start supporting case 1 and 2, and raise an exception in case 3. That way some of the users will be able to have fully functioning support for this, and we can think about how to enable case 3 further down the line.
@ap-- @andrewfulton9 I've opened a PR for case 1 and 2 in #395.