flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

[WIP] Replace Python gRPC with Rust

Open austin362667 opened this issue 1 year ago • 18 comments

Tracking issue

https://github.com/flyteorg/flyte/issues/5344

Users continuously face weird grpcio bugs, such as version errors, or even being unable to install grpcio.

Why are the changes needed?

To remove the dependency on gRPC in Python and solely replace it with Rust.

What changes were proposed in this pull request?

For details, please check the design doc.

  • [ ] Add no auth Rust client in remote.py.
    • [ ] Task endpoints https://github.com/flyteorg/flytekit/pull/2377
    • [ ] Workflow endpoints
    • [ ] Launchplan endpoints
    • [ ] Other endpoints
  • [ ] Add authentication for Rust client.
    • [ ] PKCE Authenticator
  • [ ] Add Rust CI jobs
    • [ ] Build with gRPC
    • [ ] Build without gRPC
  • [ ] Integration Test
  • [ ] Binding config
    • [ ] abi3-py38

Architecture

FlyteKitRustClient

How was this patch tested?

Right now, only get_task() has been tested for the POC (proof of concept) purpose. There is still a lot of work to be done afterwards.

Setup process

  • Use crate Tonic to compile proto files and generates client service code.
  • Use crate PyO3 to build Rust bindings for Python so we can offloading gRPC to compiled extension.

Screenshots

The performance of FlyteKit slightly benefits from Rust, which remains super amazing.

perf

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [X] All commits are signed-off.

Related PRs

https://github.com/flyteorg/flytekit/pull/2307

Docs link

austin362667 avatar Apr 04 '24 16:04 austin362667

I'm a rookie both in Rust and gRPC, so please do not hesitate to give me any kind of advice.

austin362667 avatar Apr 04 '24 16:04 austin362667

Thank you for looking into this!

Here are my high level thoughts for implementing a FlyteRemote in Rust:

  • Do not use flytekit.models. flytekit.models was required when flyte was originally designed, but it is no longer required. The Rust FlyteRemote should return the IDL objects.
    • There are many strategies to deprecating the current flytekit.remote.FlyteRemote. For example, the rust is in flytekit.remote_v2.FlyteRemote and returns IDL objects.
  • How to access the IDL objects from Python
    • My preferred goal is to also remove the dependency on grpc related packages such as googleapis-common-protos and protobuf. (Many issues come from protobuf)
    • This PR's solution is to encode the IDL in rust and decode in Python. I do not like the solution because it requires flyteidl (which requires protobuf) to decode the IDL.
      • On the other hand, I see the external traits + external struct issue. The only workaround I see is to have our own rust macro that creates a custom struct, that wraps the external struct and adds in the pyo3 traits we need. (I am not sure how well this works)
  • We use the the GRPC Python API to create intercepters. It'll likely have to migrate these intercepters to rust as well.
  • There is a plan to move the flytekit repo into flyte, so it'll be easier to manage all the protobufs.

thomasjpfan avatar Apr 04 '24 19:04 thomasjpfan

Thank you so much, Thomas @thomasjpfan . I will discuss with Kevin @pingsutw in two hours.

access the IDL objects from Python

That makes sense to me. Sounds like my original approach. Back then, I didn't know that deprecating the current flytekit.remote.FlyteRemote is in the roadmap. So, if removing Python dependencies not only on gRPC but also on protobuf is the goal, we should eventually achieve that. It's not as aggressive as it may sound because, as you mentioned, we can task a flytekit.remote_v2.FlyteRemote strategy.

external traits + external struct issue

For sure, this is the tricky one. I will definitely look into a customized macro approach. Perhaps It can overcome the Rust coherence rule, or it can't. Also, it's worth mentioning that PyO3 uses get_all and set_all attribute macros for Python ends to access its every member variable nestedly all the way down to primitive types for all high-level abstracted structs. And we need a new() method for each class so that we can construct from Python, also there's the __repr__ stuff to consider.

manage all the protobufs

So, I should modify this buf/generate stuff to better generate Rust service code for future usage. The current version has no client code like AdminServiceClient. It's a more unified way to manage proto files in one single place. Alternatively, should I compile FlyteIDL proto files to start over in generating more complete Rust service code within the FlyteRS repository?

austin362667 avatar Apr 05 '24 01:04 austin362667

this is awesome but @vlad-ivanov-name seems to have a good grasp. Can you please help review and help us drive a best choices in designing.

@austin362667 also I have a PR https://github.com/flyteorg/flytekit/pull/2307

kumare3 avatar Apr 05 '24 04:04 kumare3

The current version has no client code like AdminServiceClient. It's a more unified way to manage proto files in one single place. Alternatively, should I compile FlyteIDL proto files to start over in generating more complete Rust service code within the FlyteRS repository?

If there is a way to get buf to only generate service code, then I like that solution. Otherwise, I am okay with the current solution of using https://github.com/tokio-rs/prost.

For sure, this is the tricky one. I will definitely look into a customized macro approach. Perhaps It can overcome the Rust coherence rule, or it can't.

To answer this question, I would use this example protobuf and try different techniques for exposing it into Python. I'll start with making sure the Person works by itself and then work in the other objects:

message Person {
  string name = 1;
  int32 id = 2;  // Unique ID number for this person.
  string email = 3;
}

I think this Rust<->Python data exchange is the unknown technical part of this work. I know removing our dependency on the Python protobuf library makes this task harder, but protobuf is what causes the most issues. For me, removing the dependency has more impact on the user experience than the speed gains.

thomasjpfan avatar Apr 05 '24 13:04 thomasjpfan

I think figuring out the Rust<->Python data exchange and writing a Rust's FlyteClient can be two separate task:

  1. The Rust FlyteClient is a pure rust library that implements and API required for FlyteRemote. This pure rust library does not depend on pyo3 and does not worry about the Python binding. I find that having a pure rust library that does not depend on pyo3 makes it easier to test in rust.
  2. A second rust library that depends on pyo3 and binds the Rust FlyteClient with Python. All this library cares about is binding.

I feel like 2. is the unknown part right now, so I am pushing to figuring that out first.

thomasjpfan avatar Apr 05 '24 13:04 thomasjpfan

I think something's up with github as it first duplicated my review, and after I hid one duplicate the second disappeared 🙈 sorry about that, you can expand the comment above

vlad-ivanov-name avatar Apr 05 '24 15:04 vlad-ivanov-name

get buf to only generate service code

I think @eapolinario resolved this issue with PR https://github.com/flyteorg/flyte/pull/5187.

removing the dependency has more impact on the user experience than the speed gains

I totally agree with this objective.

makes this task harder, but protobuf is what causes the most issues

Kafka PMC @chia7712 is interested in the purpose of these refactorings. In his experience, projects with multiple languages often struggle, making consistency maintenance difficult. For example, Integrating Java with C/C++ can cause memory leaks, and Java with Scala may lead to binary compatibility issues. This complexity extends to build logic, requiring various tools and additional manpower for maintenance.

Rust<->Python data exchange

From what I know, this Person class will function seamlessly whether using a strategy of serializing into bytes or through PyO3-supported type conversions of primitive types when exchanging data. By adding #[pyclass(module="flyrs", get_all)] to every struct and enum via a sed command or something like that, you can expose identical classes and all its members to Python from Rust.

With respect to the Rust coherence rule, be careful when using external crates like prost-types for well-known types such as prost_types::Duration. Inside our crate, we cannot easily impl the external trait from crate pyo3 for external class from another crate prost-types. (You can learn more details from rustc --explain E0117 and rustc --explain E0210.)

For my understanding, we now have two very different solutions:

  1. Maintain two identical IDL classes under the hood that are separate for languages, Rust and Python.
  2. Communication through bytes.

In the earlier sync with Kevin @pingsutw, he raised a concern: Can we really get rid of all IDL objects in Python? If we remove the dependency on protobuf, what would be the size of impact?

austin362667 avatar Apr 05 '24 15:04 austin362667

Thank you! @vlad-ivanov-name , these suggestions are super helpful. We also need some advice to help us make the best choices in designing that @thomasjpfan and I just discussed. FYR design doc here

austin362667 avatar Apr 05 '24 15:04 austin362667

With respect to the Rust coherence rule, be careful when using external crates like prost-types for well-known types such as prost_types::Duration.

Thank you for the info! That is good to know.

  1. Maintain two identical IDL classes under the hood that are separate for languages, Rust and Python.
  2. Communication through bytes.

Yup, I agree that are the two paths.

Can we really get rid of all IDL objects in Python? If we remove the dependency on protobuf, what would be the size of impact?

The overall size is a minor benefit. For me, getting rid of the protobuf dependency has two major benefits:

  1. Less dependencies which makes flytekit easier to install and less likely to conflict with the user's libraries. Any conflicts with a user's domain libraries with flytekit will result a worst experience.
  2. If another library uses an older version of protobuf, they will get the following error. This adds leads to a worst user experience from using flytekit.
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your `protos`, some other possible workarounds are:
 1. Downgrade the `protobuf` package to 3.20.x or lower.
 2. Set `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` (but this will use pure-Python parsing and will be much slower).

On a technical level going through Bytes and using the Python IDL + protobuf works. Without removing the protobuf dependency, I do not see a major win from the developer experience side. Concretely, here is how I see the pros and cons of a Rust FlyteRemote:

Pros

  1. Slightly faster with Rust FlyteRemote
  2. Maybe more memory efficient with Rust FlyteRemote

Cons

  1. We need to maintain a rust library and all the complexities with it. For example, we'll have to package and compile binary wheels in flytekit.
  2. If we want to use Python async in the future, then the async story is not great with PyO3.

I can get behind a Rust FlyteRemote if we can point to user facing issues around using Python's grpc library. Personally, I have not come across issues with using grpc from Python.

thomasjpfan avatar Apr 05 '24 16:04 thomasjpfan

Thanks for @austin362667 to share the issue with me (and ping me here)

I do enjoy learning the tech design so I talked with Austin for the issue. However, my comments may be produced without enough context and details. And so please feel free to correct my concerns as that is good for my education :)

chia7712 avatar Apr 05 '24 16:04 chia7712

size is a minor benefit

Sorry for the confusion, my fault for misleading you. By 'size of impact', I mean how much will be affect by removing Python IDL dependencies.

But yeah, I get your points. Thanks for the thoughtful elaborations.

two major benefits pros and cons

Mind letting me add the above analysis to the design doc? I've invited you as an editor, too.

austin362667 avatar Apr 05 '24 17:04 austin362667

@chia7712 It's really nice of you sharing your adept insights with us here.

austin362667 avatar Apr 05 '24 17:04 austin362667

Mind letting me add the above analysis to the design doc? I've invited you as an editor, too.

Yup, you can add anything of my comments into your design doc.

thomasjpfan avatar Apr 05 '24 18:04 thomasjpfan

@thomasjpfan / @austin362667 / @eapolinario should we just make this a common FlyteRustClient? like in flyteidl and then create the binding layer only in flytekit? This way we will be pure rust in flyteidl that can be imported as a crate? I agree with @thomasjpfan I would love the entire grpc, protobuf dependencies to be rust only. and then we figure out how to make it work with python separately. This way the rust core client can be used in multiple places.

kumare3 avatar Apr 05 '24 23:04 kumare3

@austin362667 I can't say I'm very familiar with flyte internals - only with some basics - but if you'd like to discuss any of more general rust-related technical questions I would be open to have a call

vlad-ivanov-name avatar Apr 06 '24 14:04 vlad-ivanov-name

like in flyteidl and then create the binding layer only in flytekit? This way we will be pure rust in flyteidl that can be imported as a crate?

I agree with having a rust crate with just a RustFlyteRemote and then having a seperate Python library do the binding. I like having a lightweight flytekit-core Python library that exposes the RustFlyteRemote through the Python IDL + protobuf (no flytekit.models). The key benefits are:

  • flytekit depends on flytekit-core for interacting with the remote. flytekit itself does not need to worry about packaging rust.
  • I've seen demand for a lightweight FlyteRemote. A flytekit-core with a RustFlyteRemote will meet that need.

XREF: Having a "rust core" for a Python library is a fairly common pattern: https://github.com/pydantic/pydantic-core

thomasjpfan avatar Apr 09 '24 14:04 thomasjpfan

Currently we have diverged code for existing python and rust from raw.py, friendly.py to remote.py. It will be hard to sync latest changes with Python side from Rust side.

Should we add redundant files from flytekit.remote here? Like, backfill.py, data.py, entities.py, executions.py, interface.py, lazy_entity.py, remote_callable.py, remote_fs.py?

Most of them require grpcio, but they are also used by flyrs/remote/remote.py, so I duplicated them here inside flyrs/remote/ for now.

The only way to check that if we get rid of grpcio or not is by ensuring that we can still use FlyteRemoteClient without installing the grpcio package. cc @pingsutw

Or somehow can we achieve this via lazy installation in Python?

austin362667 avatar Apr 15 '24 06:04 austin362667