universal_pathlib icon indicating copy to clipboard operation
universal_pathlib copied to clipboard

Add support for conversion from str in pydantic-settings

Open BartSchuurmans opened this issue 1 year ago • 2 comments
trafficstars

from pathlib import Path
from upath import UPath
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    my_path: Path
    my_upath: UPath

settings = Settings(my_path="/tmp", my_upath="/tmp")
❯ python example.py
Traceback (most recent call last):
  File "/home/bart/src/proj/example.py", line 9, in <module>
    settings = Settings(my_path="/tmp", my_upath="/tmp")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bart/src/proj/.venv/lib/python3.12/site-packages/pydantic_settings/main.py", line 152, in __init__
    super().__init__(
  File "/home/bart/src/proj/.venv/lib/python3.12/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
my_upath
  Input should be an instance of UPath [type=is_instance_of, input_value='/tmp', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/is_instance_of

When reading through some older (solved) issues, I got the impression that this used to work. Should this work out of the box?

pydantic                  2.9.2          Data validation using Python type ...
pydantic-core             2.23.4         Core functionality for Pydantic va...
pydantic-settings         2.5.2          Settings management using Pydantic
universal-pathlib         0.2.5          pathlib api extended to use fsspec...

BartSchuurmans avatar Oct 08 '24 09:10 BartSchuurmans

Hi @BartSchuurmans

Regarding the implementation, imo behavior here is correct. This only works for your pathlib.Path attributes because of lax mode: https://docs.pydantic.dev/latest/concepts/conversion_table/#__tabbed_1_1

See:

>>> from pathlib import Path
... from pydantic import ConfigDict
... from pydantic_settings import BaseSettings
... 
>>> class Settings(BaseSettings):
...     model_config = ConfigDict(strict=True)
...     my_path: Path
...     
>>> Settings(my_path="/tmp")
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    Settings(my_path="/tmp")
    ~~~~~~~~^^^^^^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/universal_pathlib/venv313/lib/python3.13/site-packages/pydantic_settings/main.py", line 144, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        **__pydantic_self__._settings_build_values(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<20 lines>...
        )
        ^
    )
    ^
  File "/Users/andreaspoehlmann/Development/universal_pathlib/venv313/lib/python3.13/site-packages/pydantic/main.py", line 211, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
my_path
  Input should be an instance of Path [type=is_instance_of, input_value='/tmp', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/is_instance_of

I can see how users would expect to be able to just cast from string by default though. You can do this right now via: https://docs.pydantic.dev/latest/concepts/validators/#before-after-wrap-and-plain-validators . But I agree, that this should be more convenient. We should make sure to expose storage_options in some way to pydantic too.

ap-- avatar Oct 08 '24 10:10 ap--

@ap-- Thanks for your quick response!

I am using the PlainValidator now to support converting str to UPath:

from typing import Annotated
from pydantic import PlainValidator
from upath import UPath
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    my_upath: Annotated[UPath, PlainValidator(lambda x: UPath(x))]

settings = Settings(my_upath="/tmp")

BartSchuurmans avatar Oct 08 '24 10:10 BartSchuurmans

Here is a slightly more complete variant that also allows you to serialize to a string again:

from typing import Annotated, Any

import pydantic
from pydantic_core import core_schema

from upath import UPath


class UPathAnnotation:
    @classmethod
    def __get_pydantic_core_schema__(
        cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
    ) -> core_schema.CoreSchema:
        from_str_schema = core_schema.chain_schema(
            [
                core_schema.str_schema(),
                core_schema.no_info_plain_validator_function(UPath),
            ]
        )

        return core_schema.json_or_python_schema(
            json_schema=from_str_schema,
            python_schema=core_schema.union_schema(
                [
                    core_schema.is_instance_schema(UPath),
                    from_str_schema,
                ]
            ),
            serialization=core_schema.to_string_ser_schema(),
        )


class MyModel(pydantic.BaseModel):
    path: Annotated[UPath, UPathAnnotation]


my_model = MyModel(path=UPath("s3://my-bucket"))
print(my_model)

my_model_json = my_model.model_dump_json()
print(my_model_json)

my_model_roundtrip = MyModel.model_validate_json(my_model_json)
print(my_model_roundtrip)
path=S3Path('s3://my-bucket/')
{"path":"s3://my-bucket/"}
path=S3Path('s3://my-bucket/')

Would you be open to integrate this into the UPath class optionally when pydantic is available in the environment? It would look something like this:

try:
    import pydantic
    from pydantic_core import core_schema

    PYDANTIC_2_AVAILABLE = tuple(map(int, pydantic.__version__.split(".")[:3])) >= (2,)
except (ImportError, AttributeError):
    PYDANTIC_2_AVAILABLE = False

class UPath:
    if PYDANTIC_2_AVAILABLE:
        @classmethod
        def __get_pydantic_core_schema__(
            cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
        ) -> core_schema.CoreSchema: ...

With this in place, anyone that already is using pydantic can use UPath natively, and we don't need to introduce any dependency on pydantic.

Happy to send a PR if this proposal is accepted.

pmeier avatar Aug 12 '25 08:08 pmeier

@pmeier, I would support adding this if it makes working with UPath and pydantic easier. Does your above implementation handle storage options as well?

@ap--, any thoughts on the above?

andrewfulton9 avatar Aug 12 '25 17:08 andrewfulton9

@andrewfulton9

Does your above implementation handle storage options as well?

It does not, but it is a straightforward addition. Can I extract the storage options from UPath(...).storage_options and set them again through UPath(..., storage_options=storage_options)?

Is there anything else that is required to (de-)serialize a UPath other than its string representation and the storage options?

pmeier avatar Aug 12 '25 20:08 pmeier

I'm not sure it will be that straight forward. I think one gotcha could be that they aren't necessarily always going to be standard type objects. For instance, the gcsfs file system can take a google.auth.credentials.Credentials object for the token key word argument which may make (de-)serialization tricky in this case.

andrewfulton9 avatar Aug 12 '25 21:08 andrewfulton9

Makes sense. In that case my suggestion above is likely off the table. pydantic requires a known schema for serialization. I thought that was just a str and maybe a dict with with native JSON types for the storage options. If that is not true, and it looks like its not, there is little we can do here.

pmeier avatar Aug 12 '25 21:08 pmeier

Hi @pmeier and @andrewfulton9

Would you be open to integrate this into the UPath class optionally when pydantic is available in the environment?

Very much so. Yes.

Regarding serialization: To be able to correctly represent a UPath, you basically have to consider UPath().protocol, UPath().path and UPath().storage_options. In many cases users interact with universal-pathlib (and with fsspec) in terms of "urlpaths", which can be thought of as the uri string version of the three combined. The caveat here is that for many filesystems not all information can be represented as a urlpath. So the conversion from

protocol: str
path: str
storage_options: Mapping[str, Any]  # could be more specficially typed per filesystem

to urlpath: str is lossy.

case 1

That being said, there are many cases in which a trivial representation of:

f"{protocol}://{path}"  # i.e. memory:///a/b/c, file:///a/b/c, ...

is enough, because storage_options are most likely empty.

case 2

Then there are a few cases where the relevant storage_options can be easily serialized, because they have values of types that can be represented easily in json.

UPath("s3:///bucket/a/b/c", anon=True)

case 3

Some storage options are more complicated to serialize, examples are the ones that @andrewfulton9 mentioned in the comment above. We could think of options where we fall back to pickling those, or drop them from the serialization.

case 4

Once chaining support is fully available storage_option serialization is crucial, because the chained filesystems are nested in the storage_options.

protocol = "simplecache"
path = "/a/b/c"
storage_options={
    "target_protocol": "memory"
    "target_options": {"someoption": "somevalue", ...}
}

support

I think we can start supporting case 1 and 2, and raise an exception in case 3. That way some of the users will be able to have fully functioning support for this, and we can think about how to enable case 3 further down the line.

ap-- avatar Aug 12 '25 22:08 ap--

@ap-- @andrewfulton9 I've opened a PR for case 1 and 2 in #395.

pmeier avatar Aug 13 '25 13:08 pmeier