protobuf
protobuf copied to clipboard
protoc should not populate `json_name` in descriptor if no `json_name` was specified
What version of protobuf and what language are you using? Version: v3.6.0 Language: any
What operating system (Linux, Windows, ...) and version? n/a
What runtime / compiler are you using (e.g., python version or gcc version) n/a
What did you do? Steps to reproduce the behavior:
test.proto:
syntax = "proto3";
package testing;
message Animal {
string name = 1;
bool has_hooves = 2;
int32 birth_year = 3 [json_name = "year_of_birth"];
}
Command:
protoc -o /dev/stdout test.proto | protoc --decode=google.protobuf.FileDescriptorSet descriptor.proto
Output:
file {
name: "test.proto"
package: "testing"
message_type {
name: "Animal"
field {
name: "name"
number: 1
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "name"
}
field {
name: "has_hooves"
number: 2
label: LABEL_OPTIONAL
type: TYPE_BOOL
json_name: "hasHooves"
}
field {
name: "birth_year"
number: 3
label: LABEL_OPTIONAL
type: TYPE_INT32
json_name: "year_of_birth"
}
}
syntax: "proto3"
}
What did you expect to see
It should be possible to tell whether json_name
was specified in the .proto
file. In particular, there should be a way to distinguish that json_name
was specified for birth_year
, and not specified for has_hooves
.
What did you see instead?
It is impossible to tell whether json_name
was specified in the .proto
file, because protoc
invents a json_name
for each field, regardless.
Anything else
- This makes it impossible to create a json serializer that prefers the vanilla protobuf field names (with underscores, by convention), but lets individual fields be overridden. Instead, the default is the weird java-ish mangled version of field names. I'm fine with that being the default, but other options should be possible. For instance, in Go's
jsonpb
, there is anOrigName
option. But it's all-or-nothing. If you opt out of the java-ish names, you also opt out of being able to override a name. - This bloats the serialized descriptor. The language runtimes can manufacture default json field names just fine.
- This makes it less possible to recreate something close to the original
.proto
file from the descriptor.
The one thing this issue fails to address is how this works in a language neutral way. If two language runtimes fail to make the same default json field name because of a quirk with one runtime then they can't work together. Sure that might be a bug with the quirked runtime, but the whole issue could just be avoided if the runtime was given the same json name as the other runtimes via protoc.
On top of that, calculating a json name takes time. It may not be a long time, but it is time and memory. It's not time and memory to calculate it ahead of time and reference it as a constant value (depending on the runtime).
The last point doesn't seem like much of an issue. You should be distributing .proto
files to anyone that needs them. The only people I know that can't get .proto
files distributed to them are people writing Steam bots and Pokemon Go bots. Feel free to list any legitimate reasons people should be able to more easily reverse engineer a .proto
file from a descriptor.
The biggest complaint is that it makes it impossible to tell whether a user specified a json_name
or not. I suppose it would be possible to add an additional bool explicit_json_name
field or something.
We have a lot of code still on protobuf 3.0.0, and when we try to upgrade to a newer protobuf release, all the json field names change. We can use the OrigName
option in jsonpb
, but that will completely disallow any fieldname customization. What we really need is a PreferOrigName
option that still allows explicit field renaming. But the fact that protoc
is synthesizing stuff into the descriptor precludes the possibility of implementing such an option, either in upstream golang/protobuf or even in our own custom serialization code.
As far as the time burden of calculating json names… in compiled runtimes that often happens at generation time (it does in Go, currently), and even in interpreted languages, it's surely highly cacheable. Besides, they all have to have the code/machinery to be able to handle descriptors that lack json_name
anyway, so it's already there.
It's true that recreating a .proto
file from the descriptor is uncommon (although they do/can include all kinds of comment and line number information in the SourceCodeInfo
message, which seems built for that purpose); I guess I was trying to articulate the fact that it “feels wrong” to introduce extra synthetic information into the descriptor.
https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/compiler/command_line_interface.cc#L1972 I think we can have a command line option to decide whether to include json name in the output. @acozzette how do you think?
Yeah, I think a commandline option is perfect. It should always reflect the json_name
if it was specified in the .proto
file, but the commandline option could turn off the synthesis of googleCase defaults.
Any updates on this? Would sending a PR help?
Sorry this slipped off my radar. Usually we try to avoid new options but maybe it is justified in this case. I'm guessing if we just removed json_name
unconditionally then things might break. @haon4 What do you think?
I don't think we should change the current behavior as users are already relying on it. I am also not a fan of adding more command line option, as this doesn't seem like a common feature that other people are asking for, and I really don't like to add more switch for one use-case.
Can this problem be solved at the application level rather than at the protobuf level? Does it really matter if the json_name was explicitly specified or not? If you make the lowerCamelCase transformation on the field name and that matches the json_name option, can you assume that the json_name is not specified? Even if the json_name is explicitly specified in that case, does it matter?
I thought I described why earlier, but no, it is not solvable at an application level, because the way protoc currently populates json_name
for all fields makes it impossible to tell whether any given field actually had a json_name
property.
A commandline flag would be perfect.
btw, if y'all don't add a commandline option, we will be forced to fork protoc internally, and add one 😞 Either that or specify a json_name
on every single existing field that includes underscores (also 😞). I'm happy to rough together a PR adding a commandline option, if that would help 🙂.
Fair enough. If you can put together a PR for this command line option, it may help. Point taken that this can reduce the size bloat on the serialized descriptor, and most languages should already handle the fallback in case json_name
is not there, so I think that may be something we would want to enable by default in the long term. I see there are still many places that are making the assumption about json_name
being there though, but having the command line option may enable us to slowly knock them out.
I think removing json_name
from returned FileDescriptorSets
is unfortunately a breaking change at this point - external plugins could totally be relying on that to remove the need to compute json_name
themselves. @jhump
Another option would be to add a field to google.protobuf.FieldDescriptorProto
that is something like i.e. optional bool json_name_explicitly_set = 11;
that protoc
could populate, and plugins could make decisions based on this. If this was done, the builtin plugins should handle this field as well re: #6175. This seems dirty, but may actually be cleaner than a protoc
command-line flag, and it is kind of how Protobuf is intended to be used (you can't change the json_name
field, but you can iterate on your message definition by adding new fields).
Yea, I think having something like optional bool json_name_explicitly_set = 11;
would be quite nice, actually. I would be happy with that.
On second thought, I think it has to be the negative for backwards compatibility reasons, unless I'm confusing myself, ie optional bool json_name_NOT_explicitly_set = 11;
. If you don't have this, then you can't tell the difference between explicitly_set == false
being because of using an old version of protoc
, or because it was actually not set. I might be confusing myself though in double-negatives.
The work as I see it:
- Decide on a name for this field.
- Add this field to
descriptor.proto
, preferably before the 3.8.0 release but I doubt that's possible at this point. - Populate this field when writing
FileDescriptorSets
. - Pass the same
FileDescriptorSet
to the builtin plugins when compiling Protobuf files directory as is passed when running--descriptor_set_in
(the former does not send a setjson_name
to the builtin plugins) - Have each builtin plugin respect this field with regards to generation, ie do not print comments in the Java files if not explicitly set, do not set in the Python options, etc.
Note that for backwards compatibility, when running a plugin (both the builtins and external plugins), you need to do the following with this (in pseudo-code):
if get(jsonNameField) != "" {
// if this is false, it could be due to not being set at all with older versions of protoc,
// in which case we have to act as if the jsonName was explicitly set
if get(jsonNameNotExplicitlySetField) {
actAsIfNotSet()
} else {
actAsIfSet()
}
}
I think it has to be the negative for backwards compatibility reasons
Remember that descriptor.proto
is proto2 syntax. So it could just use a default value:
optional bool json_name_configured = 11 [default=true];
Good point. One issue though is that makes the definition a tad weird - even if no FileDescriptorSet provider (ie protoc et al) would produce this, what does it mean when json_name == “” && json_name_configured == true?
string foo = 1 [json_name=""];
:P
(Related: https://github.com/protocolbuffers/protobuf/issues/5682)
Lol
I'm fine with json_name_explicitly_set
with a default of false. If you're in this situation, you can probably just start using a new version of protoc
. Anything I can help do to push this forward?
I think it needs to be json_name_not_explicitly_set, this does matter for being able to really reason about the value of the field, see above
@elharo Do you know if someone is still working on this? Can we expect this config to be part of future releases?
I am willing to contribute if this is not being worked on.
I don't think anyone is working on this. However, reading through this now I confess I don't see a use case for this feature. That is, why does this matter? Until I can achieve clarity on that I don't think I'd be likely to approve any such PR. Perhaps someone else sees something here I don't.
My thinking (not dispositive) is as follows:
- Every field has a json_name which is fully determinable by reading the .proto file.
- When protoc emits a descriptor as JSON, it includes the json_name.
- The json_name in the .proto can be calculated or explicitly specified, but it doesn't matter which nor should any processing depend on that syntax detail.
I'm not sure about the "fully" in step 1. There are command line options that change this, but that doesn't change my conclusion. This feature simply doesn't seem to be needed, likely encourages bad code and non-interoperable protos, and is unlikely to be worth the added complexity in code.
Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.
Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.
@elharo, I think the fact that an artifact of this is that protoc
itself produces different code based on whether the input is the original source vs. a file descriptor set means this is also a consistency issue. And that's a problem because it means that strategies used by build tools could result in different (non-deterministic) output. That case is described more thoroughly in #6175 (which was closed as a duplicate of this issue).
We have built parsers that can read a file descriptor set and re-create the original .proto
schema. This feature enables users to auto-discover schemas during stream processing and register them in the schema registry. Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated. This results in inconsistencies and non-deterministic behavior as called out by @jhump.
The Protobuf code base also acknowledges it as a bug and a work around was introduced to deal with it.
Since this change will be non-backward compatible, I recommend that we add the boolean flag json_name_explicitly_set
(or a better name) as suggested previously.
Reference
@blacktooth "Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated." What exactly do you mean by different here? Are some fields present in one and not the other? Are the fields renumbered? Are the fields reordered? Is the white space changed? Are comments added or removed?
I don't expect the results would be byte-per-byte identical, and some of these changes are more serious than others.
For future reference, relevant code seems to be in protobuf/compiler/command_line_interface.cc void CommandLineInterface::GetTransitiveDependencies
Any progress on this?
I don't think anyone is working on this right now
Sad. #6175 was closed as a dup of this one, but summarizes the problem pretty well:
protoc: code gen is different between using source files and using descriptor set
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.
This issue is labeled inactive
because the last activity was over 90 days ago.
/keepopen