proto: introduce seastar friendly protogen
This PR introduces a new protoc plugin that can generate protos that are "seastar friendly", meaning they are not prone to oversized allocations and have serializations that yield control of the CPU when there is IO pending. The plugin is written in golang and lives in bazel/pbgen/main.go using the offical protobuf library (which includes a nice reflect library to write plugins in).
There is also now a set of Bazel rules to do said generation of classes, and serde routines to convert protos into iobufs (and back).
Usage looks like this:
load("@rules_proto//proto:defs.bzl", "proto_library")
load("//bazel/pbgen:pbgen.bzl", "redpanda_proto_library")
proto_library(
name = "version_proto",
srcs = ["version.proto"],
visibility = ["//visibility:public"],
deps = ["//proto/redpanda/pbgen:options_proto"],
)
redpanda_proto_library(
name = "version_redpanda_proto",
protos = [":version_proto"],
visibility = ["//visibility:public"],
)
The proto_library is the default rules_proto way of compiling protos to descriptor sets, and redpanda_proto_library generates our seastar friendly code. Note that redpanda_proto_library requires you passing in other redpanda_proto_library rules that it depends on.
The API of a generated proto is fairly similar to the offical protobuf library outputs, for the following proto (dependencies omitted):
/**
* Represents a person with various fields, nested messages, and enums.
*/
message Person {
// Unique identifier for the person
int32 id = 1;
// Full name
string name = 2;
// Email address
string email = 3;
// List of phone numbers
repeated PhoneNumber phones = 4;
// Gender enum
Gender gender = 5;
// Address — oneof home or work
oneof address {
string home_address = 6;
string work_address = 7;
}
// Metadata (string → string map)
map<string, string> metadata = 8;
// Nested enum
enum Gender {
GENDER_UNSPECIFIED = 0;
MALE = 1;
FEMALE = 2;
NON_BINARY = 3;
}
// Nested message
message PhoneNumber {
string number = 1;
PhoneType type = 2;
// Nested enum inside nested message
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
}
}
Has a generated class of:
// Represents a person with various fields, nested messages, and enums.
class person {
public:
// Serializes rich.example.Person into a protocol buffer.
iobuf to_proto() const;
// Serializes rich.example.Person into proto3 JSON.
iobuf to_json() const;
// Serializes rich.example.Person into a protocol buffer, in a way that will not cause stalls for large messages.
seastar::future<iobuf> to_proto_async() const;
// Serializes rich.example.Person into proto3 JSON, in a way that will not cause stalls for large messages.
seastar::future<iobuf> to_json_async() const;
// Deserializes rich.example.Person from a protocol buffer.
static void from_proto(iobuf_parser*, person*);
static person from_proto(iobuf_parser*);
// Deserializes rich.example.Person from a protocol buffer, in a way that will not cause stalls for large messages.
static seastar::future<> from_proto_async(iobuf_parser*, person*);
static seastar::future<person> from_proto_async(iobuf_parser*);
// Deserializes rich.example.Person from JSON.
static void from_json(iobuf_parser*, person*);
static person from_json(iobuf_parser*);
// Deserializes rich.example.Person from JSON, in a way that will not cause stalls for large messages.
static seastar::future<> from_json_async(iobuf_parser*, person*);
static seastar::future<person> from_json_async(iobuf_parser*);
bool operator==(const person&) const = default;
// Unique identifier for the person
int32_t get_id() const { return id_; }
void set_id(int32_t v) { id_ = v; }
// Full name
ss::sstring& get_name() { return name_; }
const ss::sstring& get_name() const { return name_; }
void set_name(ss::sstring&& v) { name_ = std::move(v); }
// Email address
ss::sstring& get_email() { return email_; }
const ss::sstring& get_email() const { return email_; }
void set_email(ss::sstring&& v) { email_ = std::move(v); }
// List of phone numbers
chunked_vector<person_phone_number>& get_phones() { return phones_; }
const chunked_vector<person_phone_number>& get_phones() const { return phones_; }
void set_phones(chunked_vector<person_phone_number>&& v) { phones_ = std::move(v); }
// Gender enum
person_gender get_gender() const { return gender_; }
void set_gender(person_gender v) { gender_ = v; }
// Address — oneof home or work
template<typename Self, typename... Args>
auto&& visit_address(this Self&& self, Args&&... args) { seastar::visit(std::forward<Self>(self).address_, std::forward<Args>(args)...); }
bool has_home_address() const { return address_.index() == 1; }
ss::sstring& get_home_address() { return std::get<1>(address_); }
const ss::sstring& get_home_address() const { return std::get<1>(address_); }
void set_home_address(ss::sstring&& v) { address_.emplace<1>(std::move(v)); }
bool has_work_address() const { return address_.index() == 2; }
ss::sstring& get_work_address() { return std::get<2>(address_); }
const ss::sstring& get_work_address() const { return std::get<2>(address_); }
void set_work_address(ss::sstring&& v) { address_.emplace<2>(std::move(v)); }
// Metadata (string → string map)
chunked_hash_map<ss::sstring, ss::sstring>& get_metadata() { return metadata_; }
const chunked_hash_map<ss::sstring, ss::sstring>& get_metadata() const { return metadata_; }
void set_metadata(chunked_hash_map<ss::sstring, ss::sstring>&& v) { metadata_ = std::move(v); }
private:
int32_t id_{};
ss::sstring name_{};
ss::sstring email_{};
chunked_vector<person_phone_number> phones_{};
person_gender gender_{};
std::variant<std::monostate, ss::sstring, ss::sstring> address_{};
chunked_hash_map<ss::sstring, ss::sstring> metadata_{};
};
The motivation behind this is to make the Admin API much easier to use. In support of this, protobuf services and RPC methods are supported, generating helper classes that handle all the serialization of RPC requests and responses. The protocol used for the generated protos is ConnectRPC, which is already used extensively by the Console and Cloud teams, the protocol is much simpler than gRPC and doesn't require HTTP2 support. Note that we only support unary RPCs for now, streaming RPCs if needed could have support added for them.
There is not yet an example of defining a new Admin API, but that will be plumbed and demonstrated before this PR is ready for review.
NOTE: Missing from this PR is support for parsing JSON and JSON Content-Types in the generated RPC code, that is dependant on https://github.com/redpanda-data/redpanda/pull/26019
Backports Required
- [x] none - not a bug fix
- [ ] none - this is a backport
- [ ] none - issue does not exist in previous branches
- [ ] none - papercut/not impactful enough to backport
- [ ] v25.1.x
- [ ] v24.3.x
- [ ] v24.2.x
Release Notes
- none
👏 👏 👏
this is gonna open up some dope use cases
CI test results
test results on build#67986
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| RaftAvailabilityTest | test_one_node_down | ducktape | https://buildkite.com/redpanda/redpanda/builds/67986#0197ad26-9dc7-4839-93ff-0b807d2cf5e5 | FLAKY | 18/21 | upstream reliability is '95.60723514211887'. current run reliability is '85.71428571428571'. drift is 9.89295 and the allowed drift is set to 50. The test should PASS | |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/67986#0197ad26-9dc2-4b6d-8ddf-422b0872d432 | FLAKY | 19/21 | upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS |
test results on build#68026
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| CloudStorageScrubberTest | test_scrubber | {"cloud_storage_type": 1} | ducktape | https://buildkite.com/redpanda/redpanda/builds/68026#0197ae28-a863-4b89-89f4-9a176c9a967c | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
test results on build#68194
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| DisablingPartitionsTest | test_disable | ducktape | https://buildkite.com/redpanda/redpanda/builds/68194#0197c1ce-731c-4b54-927f-3d9615d6c6f9 | FLAKY | 17/21 | upstream reliability is '92.1965317919075'. current run reliability is '80.95238095238095'. drift is 11.24415 and the allowed drift is set to 50. The test should PASS |
test results on build#68778
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/68778#0197f657-0c62-470d-af1f-95519a010e41 | FLAKY | 19/21 | upstream reliability is '98.6449864498645'. current run reliability is '90.47619047619048'. drift is 8.1688 and the allowed drift is set to 50. The test should PASS |
test results on build#68840
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| NodePostRestartProbeTest | post_restart_probe_test | ducktape | https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c35c-4faf-8322-b9e9396d8cdf | FLAKY | 20/21 | upstream reliability is '99.53917050691244'. current run reliability is '95.23809523809523'. drift is 4.30108 and the allowed drift is set to 50. The test should PASS | |
| NodeDecommissionSpaceManagementTest | test_decommission | {"single_partition": true} | ducktape | https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c363-41cf-be90-0be1703b1286 | FLAKY | 19/21 | upstream reliability is '99.07407407407408'. current run reliability is '90.47619047619048'. drift is 8.59788 and the allowed drift is set to 50. The test should PASS |
| RaftAvailabilityTest | test_controller_node_isolation | ducktape | https://buildkite.com/redpanda/redpanda/builds/68840#0197fa7a-c357-4cc9-849c-14a051e9fd61 | FLAKY | 20/21 | upstream reliability is '98.57142857142858'. current run reliability is '95.23809523809523'. drift is 3.33333 and the allowed drift is set to 50. The test should PASS |
test results on build#68925
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| DatalakeThrottlingTest | test_basic_throttling | {"catalog_type": "nessie", "cloud_storage_type": 1} | ducktape | https://buildkite.com/redpanda/redpanda/builds/68925#01980a0c-1cae-4d19-ab7d-bd49600f3b68 | FLAKY | 20/21 | upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/68925#01980a0d-82fc-4d9e-a36f-316b45a955be | FLAKY | 20/21 | upstream reliability is '98.87218045112782'. current run reliability is '95.23809523809523'. drift is 3.63409 and the allowed drift is set to 50. The test should PASS |
test results on build#69006
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| MaintenanceTest | test_maintenance_sticky | {"use_rpk": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/69006#01980e8c-43c9-4e43-9461-c10d49b450ff | FLAKY | 20/21 | upstream reliability is '98.18181818181819'. current run reliability is '95.23809523809523'. drift is 2.94372 and the allowed drift is set to 50. The test should PASS |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 1, "compaction_mode": "adjacent_merge", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/69006#01980e8c-43ca-4b9d-9ae2-1c9049be2edb | FLAKY | 19/21 | upstream reliability is '99.44444444444444'. current run reliability is '90.47619047619048'. drift is 8.96825 and the allowed drift is set to 50. The test should PASS |
test results on build#69119
| test_class | test_method | test_arguments | test_kind | job_url | test_status | passed | reason |
|---|---|---|---|---|---|---|---|
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} | ducktape | https://buildkite.com/redpanda/redpanda/builds/69119#019814d0-ed07-454d-92ac-b3e4e378d9ff | FLAKY | 19/21 | upstream reliability is '98.40637450199203'. current run reliability is '90.47619047619048'. drift is 7.93018 and the allowed drift is set to 50. The test should PASS |
| RandomNodeOperationsTest | test_node_operations | {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": true} | ducktape | https://buildkite.com/redpanda/redpanda/builds/69119#019814d0-ed0c-4d0a-b96e-089aed89cc12 | FLAKY | 20/21 | upstream reliability is '98.39357429718876'. current run reliability is '95.23809523809523'. drift is 3.15548 and the allowed drift is set to 50. The test should PASS |
seems like some real coverage gaps in the parser?
in the parser?
What aspect of parsing do you mean? The parser has decent coverage from the iceberg dynamic reading of protos approach. The JSON support could certainly have more coverage. Some of the lines missing coverage are likely because we don't check weird map keys and there are things like NaN, Inf we need to add support for
Let me add our conformance protobuf to the round_trip_test which should bump up coverage
What aspect of parsing do you mean?
Oh I was mistaken I thought when I had looked earlier that the parser was 100% new in this PR. Indeed it is not.
The parser could have better coverage tho, I'll look at improving that.
I added the conformance proto to the manually serialized code and caught a bunch of issues.
I also removed the non-async versions of parsing/serializing. Maybe we want those back someday, but for now I will remove. The conformance proto was taking a minute to compile locally with all the duplicated code between the async and non async versions.
seems like some real coverage gaps in the parser?
Post conformance testing:
How far do we expect these generated structs to reach inside redpanda codebase? Will we require in any way that they don't go say past RPC layer? If no, then I'd like to request to consider splitting the struct generation from serde and rpc code (mostly concerned about includes).
I've tried to keep includes to be a minimal set depending on what proto features are used. Currently there are no plans to use these outside the admin layer, but that may change longer term. I think we should be splitting proto files so they are minimal. IE "core" proto files should be separate from the RPC only protos.
I believe most people will just cargo cult here, and I'll hopefully setup some good patterns in a followup when we plug this all into admin server.
Okay I've simplified the codegen a great deal and I'm keen to get this in to not let it bit rot.
Cleaned up the commit history if that helps
Force push: reworded some commit messages
