protobuf Ruby should embed a serialized descriptor into generated code instead of using the DSL

Currently our generated code for Ruby uses a DSL to specify protos, eg:

Google::Protobuf::DescriptorPool.generated_pool.build do
  add_file("google/protobuf/timestamp.proto", :syntax => :proto3) do
    add_message "google.protobuf.Timestamp" do
      optional :seconds, :int64, 1
      optional :nanos, :int32, 2
    end
  end
end

This provides some readability, but has a high cost: any options that have not been specifically implemented in the DSL are dropped. This includes both built-in options like packed=true (https://github.com/protocolbuffers/protobuf/issues/2294) as well as all custom options(https://github.com/protocolbuffers/protobuf/issues/1198).

To solve these problems, the generated code should embed a serialized descriptor. This will naturally preserve all options, and is easy to implement. eg.

descriptor_data = File.binread(__FILE__).split("\n__END__\n", 2)[1]
Google::Protobuf::DescriptorPool.generated_pool.add_serialized_file(descriptor_data)

__END__

 google/protobuf/descriptor.proto^R^Ogoogle.protobuf"G
^QFileDescriptorSet^R2
^Dfile^X^A ^C(^K2$.google.protobuf.FileDescriptorProto"ì^C
^SFileDescriptorProto^R^L
^Dname^X^A ^A(>-^R^O
^Gpackage^X^B ^A(>------^R^R
<snip>

This does make the generated code less readable though. So we should not do this until we are generating .rbs files: https://github.com/protocolbuffers/protobuf/issues/9495

Then we will have a world where:

_pb.rbs provides a readable description of the protobuf interface
_pb.rb provides a compact, efficient description of the message, optimized for completeness and fast runtime loading.

Sep 01 '22 19:09 haberman

Blocked by: https://github.com/protocolbuffers/protobuf/issues/9495

Sep 01 '22 19:09 haberman

Is it worth unblocking this by ignoring the rbs piece of this problem?

Who is currently reading the generated code and expecting it to be readable?
- They need to understand how the DSL generates ruby classes
- Whereas if they just read the proto directly, they need to understand how it maps to ruby
- In either case, you need to understand some (very simple) mapping.
  - Seems like the proto -> ruby should be the default here. People interacting with the protos need to know proto anyway...

Could RBS be considered orthogonal? Fixing the generated code now, while backsliding on readability (but, per above, is that actually negative?), provides real benefits (like being able to use get proto options, which has been an ask for 6.5 years...).

Readability can then be "fixed" later with RBS.

Edit: Also, fwiw, the DSL also describes / is much closer to the proto itself, not the ruby classes, where as RBS would describe ruby. So the difference between the proto and DSL is super minimal today (imo), meaning I'm not sure there is a readability backslide at all... reading the DSL today is basically reading the proto...

Sep 23 '22 21:09 jplaisted

Fixed in https://github.com/protocolbuffers/protobuf/commit/bd52d0483987f1a5186fc3daa261d1d76a787bcf

Announced in: https://protobuf.dev/news/2023-04-20/

May 02 '23 22:05 haberman