[WIP] Replace Python gRPC with Rust
Tracking issue
https://github.com/flyteorg/flyte/issues/5344
Users continuously face weird grpcio bugs, such as version errors, or even being unable to install grpcio.
Why are the changes needed?
To remove the dependency on gRPC in Python and solely replace it with Rust.
What changes were proposed in this pull request?
For details, please check the design doc.
- [ ] Add no auth Rust client in
remote.py.- [ ] Task endpoints https://github.com/flyteorg/flytekit/pull/2377
- [ ] Workflow endpoints
- [ ] Launchplan endpoints
- [ ] Other endpoints
- [ ] Add authentication for Rust client.
- [ ] PKCE Authenticator
- [ ] Add Rust CI jobs
- [ ] Build with gRPC
- [ ] Build without gRPC
- [ ] Integration Test
- [ ] Binding config
- [ ]
abi3-py38
- [ ]
Architecture
How was this patch tested?
Right now, only get_task() has been tested for the POC (proof of concept) purpose.
There is still a lot of work to be done afterwards.
Setup process
- Use crate Tonic to compile proto files and generates client service code.
- Use crate PyO3 to build Rust bindings for Python so we can offloading gRPC to compiled extension.
Screenshots
The performance of FlyteKit slightly benefits from Rust, which remains super amazing.
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [X] All commits are signed-off.
Related PRs
https://github.com/flyteorg/flytekit/pull/2307
Docs link
I'm a rookie both in Rust and gRPC, so please do not hesitate to give me any kind of advice.
Thank you for looking into this!
Here are my high level thoughts for implementing a FlyteRemote in Rust:
- Do not use
flytekit.models.flytekit.modelswas required whenflytewas originally designed, but it is no longer required. The Rust FlyteRemote should return the IDL objects.- There are many strategies to deprecating the current
flytekit.remote.FlyteRemote. For example, the rust is inflytekit.remote_v2.FlyteRemoteand returns IDL objects.
- There are many strategies to deprecating the current
- How to access the IDL objects from Python
- My preferred goal is to also remove the dependency on
grpcrelated packages such asgoogleapis-common-protosandprotobuf. (Many issues come fromprotobuf) - This PR's solution is to encode the IDL in rust and decode in Python. I do not like the solution because it requires
flyteidl(which requiresprotobuf) to decode the IDL.- On the other hand, I see the external traits + external struct issue. The only workaround I see is to have our own rust macro that creates a custom struct, that wraps the external struct and adds in the pyo3 traits we need. (I am not sure how well this works)
- My preferred goal is to also remove the dependency on
- We use the the GRPC Python API to create intercepters. It'll likely have to migrate these intercepters to rust as well.
- There is a plan to move the
flytekitrepo intoflyte, so it'll be easier to manage all the protobufs.
Thank you so much, Thomas @thomasjpfan . I will discuss with Kevin @pingsutw in two hours.
access the IDL objects from Python
That makes sense to me. Sounds like my original approach. Back then, I didn't know that deprecating the current flytekit.remote.FlyteRemote is in the roadmap. So, if removing Python dependencies not only on gRPC but also on protobuf is the goal, we should eventually achieve that. It's not as aggressive as it may sound because, as you mentioned, we can task a flytekit.remote_v2.FlyteRemote strategy.
external traits + external struct issue
For sure, this is the tricky one. I will definitely look into a customized macro approach. Perhaps It can overcome the Rust coherence rule, or it can't.
Also, it's worth mentioning that PyO3 uses get_all and set_all attribute macros for Python ends to access its every member variable nestedly all the way down to primitive types for all high-level abstracted structs. And we need a new() method for each class so that we can construct from Python, also there's the __repr__ stuff to consider.
manage all the protobufs
So, I should modify this buf/generate stuff to better generate Rust service code for future usage. The current version has no client code like AdminServiceClient. It's a more unified way to manage proto files in one single place. Alternatively, should I compile FlyteIDL proto files to start over in generating more complete Rust service code within the FlyteRS repository?
this is awesome but @vlad-ivanov-name seems to have a good grasp. Can you please help review and help us drive a best choices in designing.
@austin362667 also I have a PR https://github.com/flyteorg/flytekit/pull/2307
The current version has no client code like AdminServiceClient. It's a more unified way to manage proto files in one single place. Alternatively, should I compile FlyteIDL proto files to start over in generating more complete Rust service code within the FlyteRS repository?
If there is a way to get buf to only generate service code, then I like that solution. Otherwise, I am okay with the current solution of using https://github.com/tokio-rs/prost.
For sure, this is the tricky one. I will definitely look into a customized macro approach. Perhaps It can overcome the Rust coherence rule, or it can't.
To answer this question, I would use this example protobuf and try different techniques for exposing it into Python. I'll start with making sure the Person works by itself and then work in the other objects:
message Person {
string name = 1;
int32 id = 2; // Unique ID number for this person.
string email = 3;
}
I think this Rust<->Python data exchange is the unknown technical part of this work. I know removing our dependency on the Python protobuf library makes this task harder, but protobuf is what causes the most issues. For me, removing the dependency has more impact on the user experience than the speed gains.
I think figuring out the Rust<->Python data exchange and writing a Rust's FlyteClient can be two separate task:
- The Rust
FlyteClientis a pure rust library that implements and API required forFlyteRemote. This pure rust library does not depend onpyo3and does not worry about the Python binding. I find that having a pure rust library that does not depend onpyo3makes it easier to test in rust. - A second rust library that depends on
pyo3and binds the RustFlyteClientwith Python. All this library cares about is binding.
I feel like 2. is the unknown part right now, so I am pushing to figuring that out first.
I think something's up with github as it first duplicated my review, and after I hid one duplicate the second disappeared 🙈 sorry about that, you can expand the comment above
get buf to only generate service code
I think @eapolinario resolved this issue with PR https://github.com/flyteorg/flyte/pull/5187.
removing the dependency has more impact on the user experience than the speed gains
I totally agree with this objective.
makes this task harder, but
protobufis what causes the most issues
Kafka PMC @chia7712 is interested in the purpose of these refactorings. In his experience, projects with multiple languages often struggle, making consistency maintenance difficult. For example, Integrating Java with C/C++ can cause memory leaks, and Java with Scala may lead to binary compatibility issues. This complexity extends to build logic, requiring various tools and additional manpower for maintenance.
Rust<->Python data exchange
From what I know, this Person class will function seamlessly whether using a strategy of serializing into bytes or through PyO3-supported type conversions of primitive types when exchanging data.
By adding #[pyclass(module="flyrs", get_all)] to every struct and enum via a sed command or something like that, you can expose identical classes and all its members to Python from Rust.
With respect to the Rust coherence rule, be careful when using external crates like prost-types for well-known types such as prost_types::Duration. Inside our crate, we cannot easily impl the external trait from crate pyo3 for external class from another crate prost-types.
(You can learn more details from rustc --explain E0117 and rustc --explain E0210.)
For my understanding, we now have two very different solutions:
- Maintain two identical IDL classes under the hood that are separate for languages, Rust and Python.
- Communication through bytes.
In the earlier sync with Kevin @pingsutw, he raised a concern: Can we really get rid of all IDL objects in Python? If we remove the dependency on protobuf, what would be the size of impact?
Thank you! @vlad-ivanov-name , these suggestions are super helpful. We also need some advice to help us make the best choices in designing that @thomasjpfan and I just discussed. FYR design doc here
With respect to the Rust coherence rule, be careful when using external crates like prost-types for well-known types such as prost_types::Duration.
Thank you for the info! That is good to know.
- Maintain two identical IDL classes under the hood that are separate for languages, Rust and Python.
- Communication through bytes.
Yup, I agree that are the two paths.
Can we really get rid of all IDL objects in Python? If we remove the dependency on protobuf, what would be the size of impact?
The overall size is a minor benefit. For me, getting rid of the protobuf dependency has two major benefits:
- Less dependencies which makes
flytekiteasier to install and less likely to conflict with the user's libraries. Any conflicts with a user's domain libraries withflytekitwill result a worst experience. - If another library uses an older version of protobuf, they will get the following error. This adds leads to a worst user experience from using
flytekit.
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your `protos`, some other possible workarounds are:
1. Downgrade the `protobuf` package to 3.20.x or lower.
2. Set `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` (but this will use pure-Python parsing and will be much slower).
On a technical level going through Bytes and using the Python IDL + protobuf works. Without removing the protobuf dependency, I do not see a major win from the developer experience side. Concretely, here is how I see the pros and cons of a Rust FlyteRemote:
Pros
- Slightly faster with Rust FlyteRemote
- Maybe more memory efficient with Rust FlyteRemote
Cons
- We need to maintain a rust library and all the complexities with it. For example, we'll have to package and compile binary wheels in
flytekit. - If we want to use Python
asyncin the future, then the async story is not great with PyO3.
I can get behind a Rust FlyteRemote if we can point to user facing issues around using Python's grpc library. Personally, I have not come across issues with using grpc from Python.
Thanks for @austin362667 to share the issue with me (and ping me here)
I do enjoy learning the tech design so I talked with Austin for the issue. However, my comments may be produced without enough context and details. And so please feel free to correct my concerns as that is good for my education :)
size is a minor benefit
Sorry for the confusion, my fault for misleading you. By 'size of impact', I mean how much will be affect by removing Python IDL dependencies.
But yeah, I get your points. Thanks for the thoughtful elaborations.
two major benefits pros and cons
Mind letting me add the above analysis to the design doc? I've invited you as an editor, too.
@chia7712 It's really nice of you sharing your adept insights with us here.
Mind letting me add the above analysis to the design doc? I've invited you as an editor, too.
Yup, you can add anything of my comments into your design doc.
@thomasjpfan / @austin362667 / @eapolinario should we just make this a common FlyteRustClient? like in flyteidl and then create the binding layer only in flytekit? This way we will be pure rust in flyteidl that can be imported as a crate? I agree with @thomasjpfan I would love the entire grpc, protobuf dependencies to be rust only. and then we figure out how to make it work with python separately. This way the rust core client can be used in multiple places.
@austin362667 I can't say I'm very familiar with flyte internals - only with some basics - but if you'd like to discuss any of more general rust-related technical questions I would be open to have a call
like in flyteidl and then create the binding layer only in flytekit? This way we will be pure rust in flyteidl that can be imported as a crate?
I agree with having a rust crate with just a RustFlyteRemote and then having a seperate Python library do the binding. I like having a lightweight flytekit-core Python library that exposes the RustFlyteRemote through the Python IDL + protobuf (no flytekit.models). The key benefits are:
flytekitdepends onflytekit-corefor interacting with the remote.flytekititself does not need to worry about packaging rust.- I've seen demand for a lightweight
FlyteRemote. Aflytekit-corewith aRustFlyteRemotewill meet that need.
XREF: Having a "rust core" for a Python library is a fairly common pattern: https://github.com/pydantic/pydantic-core
Currently we have diverged code for existing python and rust from
raw.py,friendly.pytoremote.py. It will be hard to sync latest changes with Python side from Rust side.
Should we add redundant files from flytekit.remote here? Like, backfill.py, data.py, entities.py, executions.py, interface.py, lazy_entity.py, remote_callable.py, remote_fs.py?
Most of them require grpcio, but they are also used by flyrs/remote/remote.py, so I duplicated them here inside flyrs/remote/ for now.
The only way to check that if we get rid of grpcio or not is by ensuring that we can still use FlyteRemoteClient without installing the grpcio package. cc @pingsutw
Or somehow can we achieve this via lazy installation in Python?