grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

[protobuf2, protobuf3] Text Format Language Specification is not implemented correctly for protobuf2 or protobuf3.

Open kaby76 opened this issue 3 months ago • 1 comments

It says in the specs for proto2 and proto3: MessageValue is defined in the [Text Format Language Specification](https://protobuf.dev/reference/protobuf/textformat-spec#fields).. MessageValue is used in both specs as:

constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ] floatLit ) |
                strLit | boolLit | MessageValue

The Text Format Language Specification (https://protobuf.dev/reference/protobuf/textformat-spec/) is an entirely different spec, different language. However, it's only used with Protocol Buffers. The EBNF for the language is given in the Text Format Language spec (https://protobuf.dev/reference/protobuf/textformat-spec/#fields). *NB: As noted in the spec, the EBNF was reverse-engineered from the C++ parser implementation. We know where this can lead us...`

Field        = ScalarField | MessageField ;
MessageField = FieldName, [ ":" ], ( MessageValue | MessageList ) [ ";" | "," ];
ScalarField  = FieldName, ":",     ( ScalarValue  | ScalarList  ) [ ";" | "," ];
MessageList  = "[", [ MessageValue, { ",", MessageValue } ], "]" ;
ScalarList   = "[", [ ScalarValue,  { ",", ScalarValue  } ], "]" ;
MessageValue = "{", Message, "}" | "<", Message, ">" ;
ScalarValue  = String
             | Float
             | Identifier
             | SignedIdentifier
             | DecSignedInteger
             | OctSignedInteger
             | HexSignedInteger
             | DecUnsignedInteger
             | OctUnsignedInteger
             | HexUnsignedInteger ;

Unfortunately, all of this does not correspond to the EBNF in the Antlr grammars.

https://github.com/antlr/grammars-v4/blob/6b517735620223475eefaa85c92f8d6bce15f360/protobuf/protobuf2/Protobuf2.g4#L267-L270

https://github.com/antlr/grammars-v4/blob/6b517735620223475eefaa85c92f8d6bce15f360/protobuf/protobuf3/Protobuf3.g4#L257-L260

MessageValue can be delimited by { ... } or by < ... >, but the Antlr grammar does not accept that.

NB: I will have to check what protoc does before changing the grammar.

kaby76 avatar Sep 26 '25 12:09 kaby76

I'm thinking that this should be implemented via a shared .g4 grammar, which is import-ed. As far as I can tell, the Text Format Language Specification is not versioned.

kaby76 avatar Sep 27 '25 12:09 kaby76