redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

proto: introduce seastar friendly protogen

Open rockwotj opened this issue 7 months ago • 2 comments

This PR introduces a new protoc plugin that can generate protos that are "seastar friendly", meaning they are not prone to oversized allocations and have serializations that yield control of the CPU when there is IO pending. The plugin is written in golang and lives in bazel/pbgen/main.go using the offical protobuf library (which includes a nice reflect library to write plugins in).

There is also now a set of Bazel rules to do said generation of classes, and serde routines to convert protos into iobufs (and back).

Usage looks like this:

load("@rules_proto//proto:defs.bzl", "proto_library")
load("//bazel/pbgen:pbgen.bzl", "redpanda_proto_library")

proto_library(
    name = "version_proto",
    srcs = ["version.proto"],
    visibility = ["//visibility:public"],
    deps = ["//proto/redpanda/pbgen:options_proto"],
)

redpanda_proto_library(
    name = "version_redpanda_proto",
    protos = [":version_proto"],
    visibility = ["//visibility:public"],
)

The proto_library is the default rules_proto way of compiling protos to descriptor sets, and redpanda_proto_library generates our seastar friendly code. Note that redpanda_proto_library requires you passing in other redpanda_proto_library rules that it depends on.

The API of a generated proto is fairly similar to the offical protobuf library outputs, for the following proto (dependencies omitted):

/**
 * Represents a person with various fields, nested messages, and enums.
 */
message Person {
  // Unique identifier for the person
  int32 id = 1;

  // Full name
  string name = 2;

  // Email address
  string email = 3;

  // List of phone numbers
  repeated PhoneNumber phones = 4;

  // Gender enum
  Gender gender = 5;

  // Address — oneof home or work
  oneof address {
    string home_address = 6;
    string work_address = 7;
  }

  // Metadata (string → string map)
  map<string, string> metadata = 8;

  // Nested enum
  enum Gender {
    GENDER_UNSPECIFIED = 0;
    MALE = 1;
    FEMALE = 2;
    NON_BINARY = 3;
  }

  // Nested message
  message PhoneNumber {
    string number = 1;
    PhoneType type = 2;

    // Nested enum inside nested message
    enum PhoneType {
      MOBILE = 0;
      HOME = 1;
      WORK = 2;
    }
  }
}

Has a generated class of:

// Represents a person with various fields, nested messages, and enums.
class person {
public:
  // Serializes rich.example.Person into a protocol buffer.
  iobuf to_proto() const;
  // Serializes rich.example.Person into proto3 JSON.
  iobuf to_json() const;
  // Serializes rich.example.Person into a protocol buffer, in a way that will not cause stalls for large messages.
  seastar::future<iobuf> to_proto_async() const;
  // Serializes rich.example.Person into proto3 JSON, in a way that will not cause stalls for large messages.
  seastar::future<iobuf> to_json_async() const;
  // Deserializes rich.example.Person from a protocol buffer.
  static void from_proto(iobuf_parser*, person*);
  static person from_proto(iobuf_parser*);
  // Deserializes rich.example.Person from a protocol buffer, in a way that will not cause stalls for large messages.
  static seastar::future<> from_proto_async(iobuf_parser*, person*);
  static seastar::future<person> from_proto_async(iobuf_parser*);
  // Deserializes rich.example.Person from JSON.
  static void from_json(iobuf_parser*, person*);
  static person from_json(iobuf_parser*);
  // Deserializes rich.example.Person from JSON, in a way that will not cause stalls for large messages.
  static seastar::future<> from_json_async(iobuf_parser*, person*);
  static seastar::future<person> from_json_async(iobuf_parser*);
  
  bool operator==(const person&) const = default;  
  // Unique identifier for the person
  int32_t get_id() const { return id_; }
  void set_id(int32_t v) { id_ = v; }
  // Full name
  ss::sstring& get_name() { return name_; }
  const ss::sstring& get_name() const { return name_; }
  void set_name(ss::sstring&& v) { name_ = std::move(v); }
  // Email address
  ss::sstring& get_email() { return email_; }
  const ss::sstring& get_email() const { return email_; }
  void set_email(ss::sstring&& v) { email_ = std::move(v); }
  // List of phone numbers
  chunked_vector<person_phone_number>& get_phones() { return phones_; }
  const chunked_vector<person_phone_number>& get_phones() const { return phones_; }
  void set_phones(chunked_vector<person_phone_number>&& v) { phones_ = std::move(v); }
  // Gender enum
  person_gender get_gender() const { return gender_; }
  void set_gender(person_gender v) { gender_ = v; }
  // Address — oneof home or work
  template<typename Self, typename... Args>
  auto&& visit_address(this Self&& self, Args&&... args) { seastar::visit(std::forward<Self>(self).address_, std::forward<Args>(args)...); }
  bool has_home_address() const { return address_.index() == 1; }
  ss::sstring& get_home_address() { return std::get<1>(address_); }
  const ss::sstring& get_home_address() const { return std::get<1>(address_); }
  void set_home_address(ss::sstring&& v) { address_.emplace<1>(std::move(v)); }
  bool has_work_address() const { return address_.index() == 2; }
  ss::sstring& get_work_address() { return std::get<2>(address_); }
  const ss::sstring& get_work_address() const { return std::get<2>(address_); }
  void set_work_address(ss::sstring&& v) { address_.emplace<2>(std::move(v)); }
  // Metadata (string → string map)
  chunked_hash_map<ss::sstring, ss::sstring>& get_metadata() { return metadata_; }
  const chunked_hash_map<ss::sstring, ss::sstring>& get_metadata() const { return metadata_; }
  void set_metadata(chunked_hash_map<ss::sstring, ss::sstring>&& v) { metadata_ = std::move(v); }

private:
  int32_t id_{};
  ss::sstring name_{};
  ss::sstring email_{};
  chunked_vector<person_phone_number> phones_{};
  person_gender gender_{};
  std::variant<std::monostate, ss::sstring, ss::sstring> address_{};
  chunked_hash_map<ss::sstring, ss::sstring> metadata_{};
};

The motivation behind this is to make the Admin API much easier to use. In support of this, protobuf services and RPC methods are supported, generating helper classes that handle all the serialization of RPC requests and responses. The protocol used for the generated protos is ConnectRPC, which is already used extensively by the Console and Cloud teams, the protocol is much simpler than gRPC and doesn't require HTTP2 support. Note that we only support unary RPCs for now, streaming RPCs if needed could have support added for them.

There is not yet an example of defining a new Admin API, but that will be plumbed and demonstrated before this PR is ready for review.

NOTE: Missing from this PR is support for parsing JSON and JSON Content-Types in the generated RPC code, that is dependant on https://github.com/redpanda-data/redpanda/pull/26019

Backports Required

  • [x] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [ ] v25.1.x
  • [ ] v24.3.x
  • [ ] v24.2.x

Release Notes

  • none

rockwotj avatar Jun 13 '25 16:06 rockwotj

👏 👏 👏

dotnwat avatar Jun 13 '25 17:06 dotnwat

this is gonna open up some dope use cases

dotnwat avatar Jun 13 '25 17:06 dotnwat

CI test results

test results on build#67986
test_class test_method test_arguments test_kind job_url test_status passed reason
RaftAvailabilityTest test_one_node_down ducktape https://buildkite.com/redpanda/redpanda/builds/67986#0197ad26-9dc7-4839-93ff-0b807d2cf5e5 FLAKY 18/21 upstream reliability is '95.60723514211887'. current run reliability is '85.71428571428571'. drift is 9.89295 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67986#0197ad26-9dc2-4b6d-8ddf-422b0872d432 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
test results on build#68026
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/68026#0197ae28-a863-4b89-89f4-9a176c9a967c FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#68194
test_class test_method test_arguments test_kind job_url test_status passed reason
DisablingPartitionsTest test_disable ducktape https://buildkite.com/redpanda/redpanda/builds/68194#0197c1ce-731c-4b54-927f-3d9615d6c6f9 FLAKY 17/21 upstream reliability is '92.1965317919075'. current run reliability is '80.95238095238095'. drift is 11.24415 and the allowed drift is set to 50. The test should PASS
test results on build#68778
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/68778#0197f657-0c62-470d-af1f-95519a010e41 FLAKY 19/21 upstream reliability is '98.6449864498645'. current run reliability is '90.47619047619048'. drift is 8.1688 and the allowed drift is set to 50. The test should PASS
test results on build#68840
test_class test_method test_arguments test_kind job_url test_status passed reason
NodePostRestartProbeTest post_restart_probe_test ducktape https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c35c-4faf-8322-b9e9396d8cdf FLAKY 20/21 upstream reliability is '99.53917050691244'. current run reliability is '95.23809523809523'. drift is 4.30108 and the allowed drift is set to 50. The test should PASS
NodeDecommissionSpaceManagementTest test_decommission {"single_partition": true} ducktape https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c363-41cf-be90-0be1703b1286 FLAKY 19/21 upstream reliability is '99.07407407407408'. current run reliability is '90.47619047619048'. drift is 8.59788 and the allowed drift is set to 50. The test should PASS
RaftAvailabilityTest test_controller_node_isolation ducktape https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c357-4cc9-849c-14a051e9fd61 FLAKY 20/21 upstream reliability is '98.57142857142858'. current run reliability is '95.23809523809523'. drift is 3.33333 and the allowed drift is set to 50. The test should PASS
test results on build#68925
test_class test_method test_arguments test_kind job_url test_status passed reason
DatalakeThrottlingTest test_basic_throttling {"catalog_type": "nessie", "cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/68925#01980a0c-1cae-4d19-ab7d-bd49600f3b68 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/68925#01980a0d-82fc-4d9e-a36f-316b45a955be FLAKY 20/21 upstream reliability is '98.87218045112782'. current run reliability is '95.23809523809523'. drift is 3.63409 and the allowed drift is set to 50. The test should PASS
test results on build#69006
test_class test_method test_arguments test_kind job_url test_status passed reason
MaintenanceTest test_maintenance_sticky {"use_rpk": false} ducktape https://buildkite.com/redpanda/redpanda/builds/69006#01980e8c-43c9-4e43-9461-c10d49b450ff FLAKY 20/21 upstream reliability is '98.18181818181819'. current run reliability is '95.23809523809523'. drift is 2.94372 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "adjacent_merge", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/69006#01980e8c-43ca-4b9d-9ae2-1c9049be2edb FLAKY 19/21 upstream reliability is '99.44444444444444'. current run reliability is '90.47619047619048'. drift is 8.96825 and the allowed drift is set to 50. The test should PASS
test results on build#69119
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/69119#019814d0-ed07-454d-92ac-b3e4e378d9ff FLAKY 19/21 upstream reliability is '98.40637450199203'. current run reliability is '90.47619047619048'. drift is 7.93018 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": true} ducktape https://buildkite.com/redpanda/redpanda/builds/69119#019814d0-ed0c-4d0a-b96e-089aed89cc12 FLAKY 20/21 upstream reliability is '98.39357429718876'. current run reliability is '95.23809523809523'. drift is 3.15548 and the allowed drift is set to 50. The test should PASS

vbotbuildovich avatar Jun 26 '25 19:06 vbotbuildovich

seems like some real coverage gaps in the parser?

image

dotnwat avatar Jun 27 '25 16:06 dotnwat

in the parser?

What aspect of parsing do you mean? The parser has decent coverage from the iceberg dynamic reading of protos approach. The JSON support could certainly have more coverage. Some of the lines missing coverage are likely because we don't check weird map keys and there are things like NaN, Inf we need to add support for

rockwotj avatar Jun 27 '25 16:06 rockwotj

Let me add our conformance protobuf to the round_trip_test which should bump up coverage

rockwotj avatar Jun 27 '25 16:06 rockwotj

What aspect of parsing do you mean?

Oh I was mistaken I thought when I had looked earlier that the parser was 100% new in this PR. Indeed it is not.

dotnwat avatar Jun 27 '25 17:06 dotnwat

The parser could have better coverage tho, I'll look at improving that.

rockwotj avatar Jun 27 '25 20:06 rockwotj

I added the conformance proto to the manually serialized code and caught a bunch of issues.

rockwotj avatar Jun 27 '25 20:06 rockwotj

I also removed the non-async versions of parsing/serializing. Maybe we want those back someday, but for now I will remove. The conformance proto was taking a minute to compile locally with all the duplicated code between the async and non async versions.

rockwotj avatar Jun 27 '25 20:06 rockwotj

seems like some real coverage gaps in the parser? image

Post conformance testing:

Screenshot 2025-06-27 at 3 27 17 PM

rockwotj avatar Jun 27 '25 20:06 rockwotj

How far do we expect these generated structs to reach inside redpanda codebase? Will we require in any way that they don't go say past RPC layer? If no, then I'd like to request to consider splitting the struct generation from serde and rpc code (mostly concerned about includes).

I've tried to keep includes to be a minimal set depending on what proto features are used. Currently there are no plans to use these outside the admin layer, but that may change longer term. I think we should be splitting proto files so they are minimal. IE "core" proto files should be separate from the RPC only protos.

I believe most people will just cargo cult here, and I'll hopefully setup some good patterns in a followup when we plug this all into admin server.

rockwotj avatar Jun 30 '25 16:06 rockwotj

Okay I've simplified the codegen a great deal and I'm keen to get this in to not let it bit rot.

rockwotj avatar Jul 10 '25 20:07 rockwotj

Cleaned up the commit history if that helps

rockwotj avatar Jul 15 '25 13:07 rockwotj

Force push: reworded some commit messages

rockwotj avatar Jul 15 '25 14:07 rockwotj