Upgrading to `pydantic` v2
â ď¸ Since this PR is turning out to be a mega PR, I will try to explain everything in a very detailed manner. Please go through it before diving into the changes and feel free to contact me if you have any further questions. (Keep in mind that we have a dedicated Notion page and a recorded meeting regarding this upgrade as well.)
Why?
Why do we upgrade pydantic to v2?
When it comes to the advantages of upgrading our pydantic dependency, there are mainly two big ones:
- A significant performance boost
- Being able to keep up with our integrations that already upgraded to
pydanticv2
At the same time, we face the following challenges:
- A very significant change to our codebase
- A slightly more challenging debugging process due to the backend changes in
pydantic
Moreover, the pydantic team stopped the active development of V1. They still make releases with critical bug fixes and some security maintenance but that will also stop these at the end of June 2024.
Why do we upgrade sqlmodel and sqlalchemy along with pydantic?
- Our current version of
sqlmodel(0.0.8) does not supportpydanticv2. - They started supporting
pydanticwith version (0.0.14). Since then, they have supported both v1 and v2. - However, at 0.0.12, they made a hard switch to
sqlalchemyv2 as well. - So, any version of
sqlmodelthat supportspydanticv2 requires you to work withsqlalchemyv2. - Since we can not drop our
sqlmodeldependency, we need to upgrade all of these packages altogether.
Other packages
Due to the pydantic v2 support, there are a few more important dependencies, like fastapi, that were affected by this upgrade. To see the full list, check the changes in the pyproject.toml.
How?
Migration Guides
Before I explain the changes in our codebase, I want to mention that both the pydantic and sqlalchemy upgrades come with significant changes. You can check the respective migration guides and get more info with these links: pydantic v2 migration guide, sqlalchemy v2 migration guide
pydantic was also kind enough to offer a tool called bump-pydantic which helped a lot at the start. It roughly modified 80 files or so, mostly focused on the configuration of models and some validators. But, as you can see from the number of changed files, there were a lot of things that we still had to adopt after the tool did its migration.
The most critical changes w.r.t. the pydantic upgrade
-
Configuration for models has been reworked.
In Pydantic V2, to specify config on a model, you should set a class attribute calledÂ
model_config to be a dict with the key/value pairs you want to be used as the config. The Pydantic V1 behavior to create a class calledÂConfig in the namespace of the parentÂBaseModel subclass is now deprecated. -
Many configuration options have been either deprecated or removed. The most important ones include:
-
allow_mutationis now calledfrozenand it is set toFalseby default. -
underscore_attrs_are_privateis removed and the models behave in a way like this value is set to True. -
The
smart_unionconfiguration parameter is now removed. Now, the default behavior isleft_to_right. Check here for more details. -
json_encodershave been removed first, added back afterward, and deprecated later.- This was mainly used for
pydantic.SecretStrs in our codebase. Now, we have replaced this functionality with a custom type annotation calledZenSecretStr, which serves simply as aSecretStrwith a custom pydantic serializer.
- This was mainly used for
-
The
regexparameter is removed and a new parameter calledpatternis now introduced:Pydantic V1 used Python's regex library. Pydantic V2 uses the Rust [regex crate](https://github.com/rust-lang/regex). This crate is not just a "Rust version of regular expressions", it's a completely different approach to regular expressions. In particular, it promises linear time searching of strings in exchange for dropping a couple of features (namely look arounds and backreferences).
-
-
The
__fields__have been replaced bymodel_fields. Previously in V1, you were to be able to get a.type_for each field but this is not the case anymore. The replacement is called.annotation. However, it acts in a slightly different way. For instance, anOptional[int]field previously had a.type_intbut now the.annotationisOptional[int]. -
Validators have been heavily reworked. There are no
@validators or@root_validators anymore. The new validators are called@field_validatorand@model_validator. They now feature a lot more flexibility and functionality. IMO, this change is one of the most critical ones and it has a lot of implications when it comes to our codebase. So, if you would like to get a detailed explanation, you can check all the changes here. -
The
skip_on_failureparameter in the validator decorator has been removed. The only validator of this type that we had before now throws a warning instead of failing. -
There is an important change when it comes to the serialization of subclasses. Check this issue I have created on their GitHub page for more detail. TLDR, if you are using subclasses in your models, do not forget to use the
SerializeAsAny[NameOfBaseModel]as the annotation to keep the same serialization behavior as v1. -
You can not use subclasses so easily anymore. For instance, if you subclass
int, you can not directly use it as a type in apydanticclass. This is by design. You need to define a method calledget_pydantic_core_schemain this new class in order to be able to use it as an annotation. -
Letâs say you define a new pydantic class
A, you annotate on of its fields with another pydantic classB. Now, you subclassA, call itAâ, and it requires a subclass version ofB, letâs call thatBâ. Previously, you could use an instance ofBto create an instance ofAâ, but this is not the case anymore. You have to explicitly convertBtoBâbefore you can pass it to the constructor ofAâ. Check thebase_zen_storeandbase_secret_storeimplementations for more details. -
pydanticdefinition of generic models have been removed as well.TheÂ
pydantic.generics.GenericModel class is no longer necessary, and has been removed. Instead, you can now create genericÂBaseModel subclasses by just addingÂGeneric as a parent class on aÂBaseModel subclass directly. This looks likeÂclass MyGenericModel(BaseModel, Generic[T]): .... -
Fields do not have a
requiredfield anymore, instead they have anis_required()method. Due to this, if you would like to make a field non-required, you have to set the default value or the default factory. -
There is also a very significant change with regards to the optional and nullable fields. Most importantly in our case, if you want to define an optional value, you
have to provideat leastNoneas a default value. Otherwise, in contrast to V1, even if you doOptional[int], it will still be a required field. -
parse_objandparse_rawhave been deprecated, instead, the recommendation is to usemodel_validate. However, this method is functioning in a slightly different way. In contrast to V1, if you feed it an instance of a subclass it fails with a validation error:from pydantic import BaseModel class A(BaseModel): a: int = 3 class B(A): b: str = "str" b1 = B.model_validate(A(a=2)) # Fails with a validation error b2 = B.model_validate_json(A().model_dump_json()) # Works -
The
update_forward_refsmethod has been reworked and renamed. Now it is enough to just doMyModel.model_rebuild(). -
There is a new Python package called
pydantic-settings. Classes such as theSettingsConfigDictare now a part of this package. -
ValidatedFunctionhas been deprecated. Check theutils/pydantic_utils.pyfor further info and see if we can remove this. (tagging @schustmi here) -
ModelMetaclasshas been moved topydantic._internalmodule. Check theglobal_config.pyandtyped_model.pyin our codebase. -
They removed their collection of utility methods in their
typingmodule (including functions such asget_argsandget_origin). Since our codebase heavily used these functions, I carried the original versions over to work in our codebase. -
There were instances where we used
some_model_instance.json(). This behavior is now replaced with thesome_model_instance.model_dump_json(). However, if you would like to parameterize this process by using keys likesort_keys, this is unfortunately not possible anymore. As an alternative, I have appliedsome_model_instance.model_dump()before and then use thejsonpackage manually to dump it with thesort_keysparameter. -
pydantic.Fields with themax_lengthsetting now fail if they haveUUIDin their annotation. In such cases, I have separated the validation function. -
When it comes to fields,
field_info.extrahas been renamed tofield.json_schema_extra. You can find an example how this is being used by check the changes in thezenml.utils.secret_utils. -
This one was a bit interesting and hard to figure out. When you do
zenml up, if there is a response model that has anEnumfield defined with apydantic.Fieldand the field is parameterized withmax_length, the local server deployment will fail. Still, I can not figure out the root cause of this issue. However, this is not a critical use case so I removed these instances and now we can dozenml upsuccessfully. -
With
pydanticV2, the issue regarding multiple inherited config classes is now resolved. The related ignore tags have been removed. -
The
schema_jsonmethod is deprecated; we are usingmodel_json_schemaandjson.dumpsinstead. -
The
copymethod is deprecated; we are usingmodel_copyinstead. You can check the docstring ofBaseModel.copyfor details about how to handle include and exclude. -
Our update model decorator has been removed. At first, this change was mainly triggered by various failing
mypylinting issues because they changed the way of defining required/optional values inpydanticv2. However, soon it helped us reveal some linting issues which were suppressed by the relationship between our previous...Requestand...Updatemodels. Each update model is now implemented properly with optional annotations. -
With
pydanticv2, the error handling within the validators has been reworked as well:As mentioned in the previous sections you can raise either aÂ
ValueError orÂAssertionError(including ones generated byÂassert ... statements) within a validator to indicate validation failed. You can also raise aÂPydanticCustomError which is a bit more verbose but gives you extra flexibility. Any other errors (includingÂTypeError) are bubbled up and not wrapped in aÂValidationError. -
The following code block used to execute successfully in
pydanticv1, but this behavior has changed in pydantic v2 and it now throws a ValidationError:from pydantic import BaseModel class MyModel(BaseModel): a: str MyModel(a=2.3) -
This following code block used to print out
TrueandTruebut with the new changes inpydanticnow it outputsFalseandFalse:from pydantic import BaseModel class One(BaseModel): a: int = 3 class Two(One): pass print(One() == Two()) print(One() == {"a": 3}) -
Fields that are provided as extra fields to any model can be accessed by
.model_extranow. -
In contrast to
pydanticv1, defining anypydanticclass without properly annotating its fields will raise apydantic.errors.PydanticUserErrornow.
Critical changes w.r.t. the sqlmodel upgrade
The most critical factor in this upgrade stems from the PR right here. With this change, they have changed the way they handle Enum values.
For instance, if you are familiar with our component schema (which we defined through sqlmodel), we have a field called type which was a StackComponentType:
# the schema of the stack component
class StackComponentSchema(...):
...
type: StackComponentType
...
# and the stack component type looked like this:
class StackComponentType(StrEnum):
...
ARTIFACT_STORE = "artifact_store"
...
With this setup, when we registered, for instance, a new artifact store, we created an entry in the components table of our DB where the column type had the string artifact_store stored in it as a value. However, with the new changes, sqlmodel now gives higher priority to Enum fields and saves the value ARTIFACT_STORE instead. While this is alright if you are starting from scratch, if you have any entry in a table with an Enum field zenml will fail after the upgrade. Instead of taking the migration route, we decided to adjust our schemas to use str fields instead and updated the corresponding to_model, update, and from_request methods.
Critical changes w.r.t. the sqlalchemy upgrade
The new sqlalchemy v2 has a lot of functional and syntactic changes as well. Luckily, most of the pure sqlalchemy code in our codebase can only be found around our sql_zen_store implementation and migration scripts. I have tried my best to fix all the deprecation issues but I ask you to pay extra attention to these changes, especially around the migration scripts.
Integration Corner
I will try to update the following subsections as the fixes come along. Here you will find a list of all the integrations that have been affected by the aforementioned changes to our codebase and dependencies.
AWS
The upgrade to kfp V2 (in integrations like kubeflow, tekton, or gcp) bumps our protobuf dependency from 3.X to 4.X. This is why we need to relax the sagemaker dependency.
- [ ] This needs to be tested very thoroughly as it is one of our major integrations.
Airflow
I believe this was the most critical update. Airflow still has a dependency on sqlalchemy V1 and this conflicts with this entire PR as we have to migrate to sqlalchemy V1. However, we managed to figure out a way where we can still run pipelines on Airflow by keeping the Airflow and ZenML installation separated.
- [ ] Test the execution locally.
- [ ] Test the execution on a remote setup.
Evidently
Relaxing the main dependency here resolved the installation issue. They started supporting pydantic V2 starting from the version 0.4.16. As their latest version is 0.4.22, the dependency is limited between the two. When you install zenml and the evidently integration afterward, it installs 0.4.22. However, if you use the install-zenml-dev script, it ends up installing 0.4.16. This is why it might make sense to test both versions.
- [ ] Test a pipeline using the Evidently integration on
0.4.16 - [ ] Test a pipeline using the Evidently integration on
0.4.22
Feast
To fix the installation issues, we had to remove the redis extra from the feast integration. As the latest version is 0.37.1, the dependency is capped at 0.37.1.
- [ ] Test a pipeline using the Feast integration.
GCP
This is also one of the major changes. As they switched to their own V2, the Python SDK of kfp removed their pydantic v1 dependency, which ultimately solved our installation issues. However, this means that we have to adapt our integration accordingly to work with kfp>=2.0. You can find the migration guide for KFP SDK V1 to V2 here. Also, Felix previously worked on this issue and you can find his changes right here in this PR.
- [ ] This needs to be tested very thoroughly as it is one of our major integrations.
Great Expectations
Similar to the previous integrations, relaxing the main dependencies here resolved the installation issue. As they started supporting installations with pydantic v2 from 0.17.15, the minimum requirement was changed. There was a note in the requirements of this integration stating that typing_extensions>4.6.0 does not work with GE, and the resolved version is 4.10.0. We need to figure out if this is still an issue. Moreover, they are closing on their 1.0 release. Since this might include major breaking changes, I put the upper limit to <1.0 for now.
- [ ] Test a pipeline using the Great Expectations integration.
Kubeflow
Similar to the GCP integration, relaxing the kfp python SDK dependency resolved the installation issue, however, the code still needs to be migrated. You can find the migration guide for KFP SDK V1 to V2 here. Also, Felix previously worked on this issue and you can find his changes right here in this PR.
- [ ] Test after the migration.
mlflow
This was an interesting change. As they stand right now, the dependencies of the mlflow integration are compatible with zenml using pydantic v2. However, if you install zenml first and then do zenml integration install mlflow -y, it downgrades pydantic to v1. (I think this is an important problem that we have to solve separately in a generalized manner!) This is why I had to manually add the same duplicated pydantic requirement in the integration definition as well.
- [ ] Test it just in case.
Label Studio
They still have a hard dependency on pydantic = "<=1.11.0,>=1.7.3". @strickvl has opened up an issue on their GitHub page. We are currently waiting for further development from their side.
Skypilot
While uv was able to compile a list of requirements using pydantic>=2.7.0 with both skypilot[aws]<=0.5.0and skypilot[gcp]<=0.5.0 respectively, skypilot[azure]<=0.5.0 is still creating issues.
- [ ] Test all possible variations of this.
Tensorflow
The new version of pydantic creates a drift between the tensorflow and typing_extensions packages. Relaxing the dependencies here resolves the issue, however, there is a known issue between torch and tensorflow and we need to test whether this is still problematic.
Additionally, the upgrade to kfp V2 (in integrations like kubeflow, tekton, or gcp) bumps our protobuf dependency from 3.X to 4.X. This is another reason why the tensorflow upgrade is necessary.
- [ ] Test the
tensorflowintegration.
Tekton
The tekton integration should go through a major change as well, since it is affected by the kfp changes. You can find the migration guide for KFP SDK V1 to V2 here. Also, Felix previously worked on this issue and you can find his changes right here in this PR.
- [ ] Needs to be tested.
Docs changes
Keep in mind, much like the changes in the airflow integration, some future updates will probably require changes in our documentation.
Special Thanks
Special thanks to the pydantic team (especially @sydney-runkle) for helping us out when we got stuck. It has been a blast to work on this upgrade. Looking forward to V3 đ
Leftover TODOs
- [ ] Test out the integrations, fix the implementation and update the docs.
- [ ] The code is still using
pydantic_encoderwhich is deprecated. Find an alternative solution to it. - [ ]
any_pydantic_model.dict()method is now deprecated. Even though, I fixed and removed most of these calls, it is really hard to scan the codebase for similar instances. So, anytime you run into any deprecation warnings, we have to remove these calls. - [ ] There are a few deprecation warnings regarding
sqlalchemyas well. We need to find replacements for those as well. - [ ] We are using the
ValidatedFunctionconcept frompydanticwhen we compile our pipelines and it is now deprecated. Moreover, it the compilation fails, the error message is not very clear. To reproduce, remove theTypeErrorfrom this line, and execute bothtest_call_step_with_too_many_args_and_kwargsandtest_call_step_with_too_many_args. (they both lead to slightly different error messages.) - [ ] Refactor the
__init__call from theBaseService.
Pre-requisites
Please ensure you have done the following:
- [X] I have read the CONTRIBUTING.md document.
- [ ] If my change requires a change to docs, I have updated the documentation accordingly.
- [ ] I have added tests to cover my changes.
- [X] I have based my new branch on
developand the open PR is targetingdevelop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop. - [ ] If my changes require changes to the dashboard, these changes are communicated/requested.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [X] Breaking change (fix or feature that would cause existing functionality to change) (UNDERSTATEMENT đ )
- [ ] Other (add details above)
[!IMPORTANT]
Review skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai generate interesting stats about this repository and render them as a table.@coderabbitai show all the console.log statements in this repository.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
â ď¸ GitGuardian has uncovered 4 secrets following the scan of your pull request.
Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.
đ Detected hardcoded secrets in your pull request
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 11050046 | Triggered | Generic Password | 300edcb1187254a9d4a9cafa4b4bc6772018ab17 | src/zenml/integrations/kubernetes/service_connectors/kubernetes_service_connector.py | View secret |
| 11050046 | Triggered | Generic Password | 300edcb1187254a9d4a9cafa4b4bc6772018ab17 | src/zenml/service_connectors/docker_service_connector.py | View secret |
| 10202935 | Triggered | Generic Password | 300edcb1187254a9d4a9cafa4b4bc6772018ab17 | src/zenml/integrations/kubernetes/service_connectors/kubernetes_service_connector.py | View secret |
| 10202935 | Triggered | Generic Password | 300edcb1187254a9d4a9cafa4b4bc6772018ab17 | src/zenml/service_connectors/docker_service_connector.py | View secret |
đ Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
đŚ GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
@coderabbitai review
Actions Performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
Quickstart template updates in examples/quickstart have been pushed.
LLM Finetuning template updates in examples/llm_finetuning have been pushed.
@bcdurak @avishniakov @strickvl,
Ping me if there's anything you folks need help with / lingering questions đ. Exciting stuff here!
New and removed dependencies detected. Learn more about Socket for GitHub âď¸
| Package | New capabilities | Transitives | Size | Publisher |
|---|---|---|---|---|
| pypi/[email protected] | environment, filesystem | 0 |
288 kB | samuelcolvin |
| pypi/[email protected] | environment, eval, filesystem, network, shell, unsafe | 0 |
61.4 MB | CaselIT |
đŽ Removed packages: pypi/[email protected], pypi/[email protected]
NLP template updates in examples/e2e_nlp have been pushed.
All green, wow đ