protobuf protoc should not populate `json_name` in descriptor if no `json

What version of protobuf and what language are you using? Version: v3.6.0 Language: any

What operating system (Linux, Windows, ...) and version? n/a

What runtime / compiler are you using (e.g., python version or gcc version) n/a

What did you do? Steps to reproduce the behavior:

test.proto:

syntax = "proto3";

package testing;

message Animal {
  string name = 1;
  bool has_hooves = 2;
  int32 birth_year = 3 [json_name = "year_of_birth"];
}

Command:

protoc -o /dev/stdout test.proto | protoc --decode=google.protobuf.FileDescriptorSet descriptor.proto

Output:

file {
  name: "test.proto"
  package: "testing"
  message_type {
    name: "Animal"
    field {
      name: "name"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "name"
    }
    field {
      name: "has_hooves"
      number: 2
      label: LABEL_OPTIONAL
      type: TYPE_BOOL
      json_name: "hasHooves"
    }
    field {
      name: "birth_year"
      number: 3
      label: LABEL_OPTIONAL
      type: TYPE_INT32
      json_name: "year_of_birth"
    }
  }
  syntax: "proto3"
}

What did you expect to see It should be possible to tell whether json_name was specified in the .proto file. In particular, there should be a way to distinguish that json_name was specified for birth_year, and not specified for has_hooves.

What did you see instead? It is impossible to tell whether json_name was specified in the .proto file, because protoc invents a json_name for each field, regardless.

Anything else

This makes it impossible to create a json serializer that prefers the vanilla protobuf field names (with underscores, by convention), but lets individual fields be overridden. Instead, the default is the weird java-ish mangled version of field names. I'm fine with that being the default, but other options should be possible. For instance, in Go's jsonpb, there is an OrigName option. But it's all-or-nothing. If you opt out of the java-ish names, you also opt out of being able to override a name.
This bloats the serialized descriptor. The language runtimes can manufacture default json field names just fine.
This makes it less possible to recreate something close to the original .proto file from the descriptor.

Jan 16 '19 16:01 zellyn

The one thing this issue fails to address is how this works in a language neutral way. If two language runtimes fail to make the same default json field name because of a quirk with one runtime then they can't work together. Sure that might be a bug with the quirked runtime, but the whole issue could just be avoided if the runtime was given the same json name as the other runtimes via protoc.

On top of that, calculating a json name takes time. It may not be a long time, but it is time and memory. It's not time and memory to calculate it ahead of time and reference it as a constant value (depending on the runtime).

The last point doesn't seem like much of an issue. You should be distributing .proto files to anyone that needs them. The only people I know that can't get .proto files distributed to them are people writing Steam bots and Pokemon Go bots. Feel free to list any legitimate reasons people should be able to more easily reverse engineer a .proto file from a descriptor.

Jan 17 '19 07:01 ObsidianMinor

The biggest complaint is that it makes it impossible to tell whether a user specified a json_name or not. I suppose it would be possible to add an additional bool explicit_json_name field or something.

We have a lot of code still on protobuf 3.0.0, and when we try to upgrade to a newer protobuf release, all the json field names change. We can use the OrigName option in jsonpb, but that will completely disallow any fieldname customization. What we really need is a PreferOrigName option that still allows explicit field renaming. But the fact that protoc is synthesizing stuff into the descriptor precludes the possibility of implementing such an option, either in upstream golang/protobuf or even in our own custom serialization code.

As far as the time burden of calculating json names… in compiled runtimes that often happens at generation time (it does in Go, currently), and even in interpreted languages, it's surely highly cacheable. Besides, they all have to have the code/machinery to be able to handle descriptors that lack json_name anyway, so it's already there.

It's true that recreating a .proto file from the descriptor is uncommon (although they do/can include all kinds of comment and line number information in the SourceCodeInfo message, which seems built for that purpose); I guess I was trying to articulate the fact that it “feels wrong” to introduce extra synthetic information into the descriptor.

Jan 17 '19 13:01 zellyn

https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/compiler/command_line_interface.cc#L1972 I think we can have a command line option to decide whether to include json name in the output. @acozzette how do you think?

Jan 22 '19 21:01 TeBoring

Yeah, I think a commandline option is perfect. It should always reflect the json_name if it was specified in the .proto file, but the commandline option could turn off the synthesis of googleCase defaults.

Jan 22 '19 21:01 zellyn

Any updates on this? Would sending a PR help?

May 21 '19 19:05 zellyn

Sorry this slipped off my radar. Usually we try to avoid new options but maybe it is justified in this case. I'm guessing if we just removed json_name unconditionally then things might break. @haon4 What do you think?

May 22 '19 20:05 acozzette

I don't think we should change the current behavior as users are already relying on it. I am also not a fan of adding more command line option, as this doesn't seem like a common feature that other people are asking for, and I really don't like to add more switch for one use-case.

Can this problem be solved at the application level rather than at the protobuf level? Does it really matter if the json_name was explicitly specified or not? If you make the lowerCamelCase transformation on the field name and that matches the json_name option, can you assume that the json_name is not specified? Even if the json_name is explicitly specified in that case, does it matter?

May 22 '19 23:05 ghost

I thought I described why earlier, but no, it is not solvable at an application level, because the way protoc currently populates json_name for all fields makes it impossible to tell whether any given field actually had a json_name property.

A commandline flag would be perfect.

May 23 '19 13:05 zellyn

btw, if y'all don't add a commandline option, we will be forced to fork protoc internally, and add one 😞 Either that or specify a json_name on every single existing field that includes underscores (also 😞). I'm happy to rough together a PR adding a commandline option, if that would help 🙂.

May 23 '19 14:05 zellyn

Fair enough. If you can put together a PR for this command line option, it may help. Point taken that this can reduce the size bloat on the serialized descriptor, and most languages should already handle the fallback in case json_name is not there, so I think that may be something we would want to enable by default in the long term. I see there are still many places that are making the assumption about json_name being there though, but having the command line option may enable us to slowly knock them out.

May 23 '19 18:05 ghost

I think removing json_name from returned FileDescriptorSets is unfortunately a breaking change at this point - external plugins could totally be relying on that to remove the need to compute json_name themselves. @jhump

Another option would be to add a field to google.protobuf.FieldDescriptorProto that is something like i.e. optional bool json_name_explicitly_set = 11; that protoc could populate, and plugins could make decisions based on this. If this was done, the builtin plugins should handle this field as well re: #6175. This seems dirty, but may actually be cleaner than a protoc command-line flag, and it is kind of how Protobuf is intended to be used (you can't change the json_name field, but you can iterate on your message definition by adding new fields).

May 24 '19 13:05 bufdev

Yea, I think having something like optional bool json_name_explicitly_set = 11; would be quite nice, actually. I would be happy with that.

May 24 '19 17:05 ghost

On second thought, I think it has to be the negative for backwards compatibility reasons, unless I'm confusing myself, ie optional bool json_name_NOT_explicitly_set = 11;. If you don't have this, then you can't tell the difference between explicitly_set == false being because of using an old version of protoc, or because it was actually not set. I might be confusing myself though in double-negatives.

The work as I see it:

Decide on a name for this field.
Add this field to descriptor.proto, preferably before the 3.8.0 release but I doubt that's possible at this point.
Populate this field when writing FileDescriptorSets.
Pass the same FileDescriptorSet to the builtin plugins when compiling Protobuf files directory as is passed when running --descriptor_set_in (the former does not send a set json_name to the builtin plugins)
Have each builtin plugin respect this field with regards to generation, ie do not print comments in the Java files if not explicitly set, do not set in the Python options, etc.

Note that for backwards compatibility, when running a plugin (both the builtins and external plugins), you need to do the following with this (in pseudo-code):

if get(jsonNameField) != "" {
  // if this is false, it could be due to not being set at all with older versions of protoc,
  // in which case we have to act as if the jsonName was explicitly set
  if get(jsonNameNotExplicitlySetField) { 
    actAsIfNotSet()
  } else {
    actAsIfSet()
  }
}

May 24 '19 19:05 bufdev

I think it has to be the negative for backwards compatibility reasons

Remember that descriptor.proto is proto2 syntax. So it could just use a default value:

optional bool json_name_configured = 11 [default=true];

May 25 '19 00:05 jhump

Good point. One issue though is that makes the definition a tad weird - even if no FileDescriptorSet provider (ie protoc et al) would produce this, what does it mean when json_name == “” && json_name_configured == true?

May 25 '19 00:05 bufdev

string foo = 1 [json_name=""];

:P

(Related: https://github.com/protocolbuffers/protobuf/issues/5682)

May 25 '19 00:05 jhump

Lol

May 25 '19 00:05 bufdev

I'm fine with json_name_explicitly_set with a default of false. If you're in this situation, you can probably just start using a new version of protoc. Anything I can help do to push this forward?

Jun 05 '19 15:06 zellyn

I think it needs to be json_name_not_explicitly_set, this does matter for being able to really reason about the value of the field, see above

Jun 05 '19 15:06 bufdev

@elharo Do you know if someone is still working on this? Can we expect this config to be part of future releases?

I am willing to contribute if this is not being worked on.

Jan 28 '22 06:01 blacktooth

I don't think anyone is working on this. However, reading through this now I confess I don't see a use case for this feature. That is, why does this matter? Until I can achieve clarity on that I don't think I'd be likely to approve any such PR. Perhaps someone else sees something here I don't.

My thinking (not dispositive) is as follows:

Every field has a json_name which is fully determinable by reading the .proto file.
When protoc emits a descriptor as JSON, it includes the json_name.
The json_name in the .proto can be calculated or explicitly specified, but it doesn't matter which nor should any processing depend on that syntax detail.

I'm not sure about the "fully" in step 1. There are command line options that change this, but that doesn't change my conclusion. This feature simply doesn't seem to be needed, likely encourages bad code and non-interoperable protos, and is unlikely to be worth the added complexity in code.

Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.

Feb 01 '22 20:02 elharo

Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.

@elharo, I think the fact that an artifact of this is that protoc itself produces different code based on whether the input is the original source vs. a file descriptor set means this is also a consistency issue. And that's a problem because it means that strategies used by build tools could result in different (non-deterministic) output. That case is described more thoroughly in #6175 (which was closed as a duplicate of this issue).

Feb 02 '22 15:02 jhump

We have built parsers that can read a file descriptor set and re-create the original .proto schema. This feature enables users to auto-discover schemas during stream processing and register them in the schema registry. Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated. This results in inconsistencies and non-deterministic behavior as called out by @jhump.

The Protobuf code base also acknowledges it as a bug and a work around was introduced to deal with it.

Since this change will be non-backward compatible, I recommend that we add the boolean flag json_name_explicitly_set (or a better name) as suggested previously.

Reference

Generating proto schema from FileDescriptor

Feb 03 '22 19:02 blacktooth

@blacktooth "Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated." What exactly do you mean by different here? Are some fields present in one and not the other? Are the fields renumbered? Are the fields reordered? Is the white space changed? Are comments added or removed?

I don't expect the results would be byte-per-byte identical, and some of these changes are more serious than others.

May 12 '22 11:05 elharo

For future reference, relevant code seems to be in protobuf/compiler/command_line_interface.cc void CommandLineInterface::GetTransitiveDependencies

May 12 '22 12:05 elharo

Any progress on this?

Jul 13 '23 13:07 rdesgroppes

I don't think anyone is working on this right now

Jul 13 '23 14:07 fowles

Sad. #6175 was closed as a dup of this one, but summarizes the problem pretty well:

protoc: code gen is different between using source files and using descriptor set

Jul 13 '23 14:07 rdesgroppes

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

May 05 '24 10:05 github-actions[bot]

/keepopen

May 05 '24 10:05 rdesgroppes

protobuf
protobuf copied to clipboard

protoc should not populate `json_name` in descriptor if no `json_name` was specified

protobuf protobuf copied to clipboard

protoc should not populate `json_name` in descriptor if no `json_name` was specified

protobuf
protobuf copied to clipboard