pq
pq copied to clipboard
Canonical JSON mapping
According to this page, Proto3 defines a canonical JSON mapping. It would be awesome if pq
had an option to be compliant.
Currently, the single difference I'm aware of is the name of fields, that should be converted to lowerCamelCase. Handling this would be a great step further, but maybe there are other subtle issues to address in order to be 100% spec-compliant.
Thanks for pointing this out. Do I basically go to every key in the JSON output and just do some string manipulation to convert them to lowerCamelCase? e.g. python pseudocode: 'snake_case'.split('_')[1] + ''.join([x.capitalize() for x in 'snake_case'.split('_')[1:]])
Yes, sounds good to me ! :+1:
Here's my attempt: https://github.com/sevagh/pq/pull/75/files#diff-d3074ba1689c29fa340ca2ef02837dfaL37
There's a lot of rules but I made some good progress in making a section where we can define rules easily.
I ran this in prod and it works well. Not sure if I want it to be default or toggleable with a flag. --compliant
?
Wow, that was fast, thanks for your efforts! :+1:
I have tested v1.0.1. It seems to work very well, but only at the root level, not for sub-objects.
Regarding the default behavior, my guess is that a toggleable flag is perfect to remain compatible while in v1.x (--canonical
sounds better than --compliant
IMHO).
However, it may also make sense to define the canonical mapping as default for the next major release, while still keeping a --legacy
flag for current users.
I like the idea to switch behaviors and add a legacy flag in the next major release.
For now I need to figure out a way to descend into the Serde::Value object to find every possible key to modify.
I have tested v1.0.1. It seems to work very well, but only at the root level, not for sub-objects.
Actually maybe I'm misunderstanding the document but it seems to me that sub-objects are considered as map{"k": v}
and thus don't need to be modified?
Message field names are mapped to lowerCamelCase and become JSON object keys.
Do message field names only apply to the root keys or even sub-objects? Are sub-objects messages or maps?
map<K,V> | object | {"k": v, …} | All keys are converted to strings.
From what I understand, map<K,V>
is for free-style key-value maps, where the keys do not actually have to be converted.
The issue I encounter is when nesting message
s like this:
syntax = "proto3";
message Parent {
Child my_child = 1;
}
message Child {
string foo_bar = 1;
}
If I decode 0x0A050A0362617A
with --canonical
, I get this:
{
"myChild": {
"foo_bar": "baz"
}
}
Instead of this:
{
"myChild": {
"fooBar": "baz"
}
}
Thanks for the test case. I'll try some things.
I think you're right about maps, this is what a real map looks like:
"myChild": [
{
"key": "foo_bar",
"value": "baz"
}
]
I created this proto to test it:
syntax = "proto3";
message ParentMap {
map<string, string> my_child = 1;
}
So I think if I recursively check all map keys (where key != "key"), I can safely apply the transformation. The underlying Rust object that I traverse looks like this:
Nested proto case:
Map(
{
String(
"my_child"
): Option(
Some(
Map(
{
String(
"foo_bar"
): Option(
Some(
String(
"baz"
)
)
)
}
)
)
)
}
)
Map case:
Map(
{
String(
"my_child"
): Seq(
[
Map(
{
String(
"key"
): Option(
Some(
String(
"foo_bar"
)
)
),
String(
"value"
): Option(
Some(
String(
"baz"
)
)
)
}
)
]
)
}
)
Tentative fix:
sevagh:python $ ./proto_encoder.py single nested | ../../target/debug/pq --fdsetfile ..
/fdsets/parent_child_nested.fdset --msgtype Parent --canonical
{
"myChild": {
"fooBar": "baz"
}
}
https://github.com/sevagh/pq/pull/76
Wanna test this hotfix? pq-bin.tar.gz
It uses rust nightly to recursively peak inside the JSON and modify all keys. 1.0.2
Well done, it works! :tada: I'm not familiar with Rust, what are the implications of using a nightly? Is it stable enough to release this hotfix?
To be spec-compliant, I think that map<K,V>
should be handled differently. With your ParentMap
message, I would expect 0x0A0E0A07666F6F5F626172120362617A
to be decoded like this (keys are only converted to strings, no case transformation):
{
"myChild": {
"foo_bar": "baz"
}
}
PS: if I understand correctly, what your are currently doing is a post-processing of the deserialized message. So, it's probably difficult to reliably distinguish a map<K,V>
from a message
. Maybe you can find inspirations in the official Java implementation.
Nightly should have no serious implications, a lot of real production Rust code uses nightly.
PS: if I understand correctly, what your are currently doing is a post-processing of the deserialized message.
Yes, so I think handling your map<K, V>
case is something that should be modified in my upstream protobuf dependencies (either rust-protobuf or serde-protobuf).
In fact maybe even --canonical
's post-processing behavior belongs in serde-protobuf.
I totally agree, this should ideally be handled earlier in the deserialization process. However, can we expect these libraries to implement this in the short term?
sevagh:pq $ ./tests/python/proto_encoder.py single nested | ./target/debug/pq --fdsetfile ./tests/fdsets/parent_child_nested.fdset --msgtype Parent --canonical
{
"myChild": {
"fooBar": "baz"
}
}
sevagh:pq $ ./tests/python/proto_encoder.py single map | ./target/debug/pq --fdsetfile ./tests/fdsets/parent_child_map.fdset --msgtype ParentMap --canonical
{
"myChild": {
"foo_bar": "baz"
}
}
Some more thoughts on the upstream/canonical JSON handling question:
- rust-protobuf provides the lowest level of protobuf code. It doesn't care about JSON
- serde-protobuf deserializes protobuf into
serde_value
objects. It doesn't care about JSON. Google doesn't have any rules on what language-specific intermediate formats should look like. - It is my personal choice in
pq
to feedserde-protobuf
's outputserde_value
objects intoserde_json
to have JSON as a final output
Therefore, it should be my responsibility to massage the deserialized input to output canonical JSON. pq promises protobuf
-> json
. My dependencies don't.
Well, it makes sense, thanks for these explanations.
Your latest changes look great! Are you still able to distinguish these two cases?
syntax = "proto3";
message ParentMap {
map<string, string> my_child = 1;
}
syntax = "proto3";
message KeyValue {
string key = 1;
string value = 2;
}
message ParentMap {
repeated KeyValue my_child = 1;
}
The first should be converted to {"k": v, …}
, but not the second.
I'll whip up a test case. I didn't notice the "RepeatedField" which defeats my Seq
scan expectation.
It fails on this footgun. At this point distinguishing a RepeatedField from a real map is impossible. That one needs to be upstream.
Interestingly, look at what I find when I peak into the FileDescriptorSet
object:
fdset: file {
name: "schemata/parent_child_nested.proto"
message_type {
name: "Parent"
field {
name: "my_child"
number: 1
label: LABEL_OPTIONAL
type: TYPE_MESSAGE
type_name: ".Child"
json_name: "myChild"
}
}
message_type {
name: "Child"
field {
name: "foo_bar"
number: 1
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "fooBar"
}
}
syntax: "proto3"
fdset: file {
name: "schemata/parent_child_map.proto"
message_type {
name: "ParentMap"
field {
name: "my_child"
number: 1
label: LABEL_REPEATED
type: TYPE_MESSAGE
type_name: ".ParentMap.MyChildEntry"
json_name: "myChild"
}
nested_type {
name: "MyChildEntry"
field {
name: "key"
number: 1
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "key"
}
field {
name: "value"
number: 2
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "value"
}
options {
map_entry: true
}
}
}
message_type {
name: "KeyValue"
field {
name: "key"
number: 1
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "key"
}
field {
name: "value"
number: 2
label: LABEL_OPTIONAL
type: TYPE_STRING
json_name: "value"
}
}
message_type {
name: "ParentMapFootgun"
field {
name: "my_child"
number: 1
label: LABEL_REPEATED
type: TYPE_MESSAGE
type_name: ".KeyValue"
json_name: "myChild"
}
}
syntax: "proto3"
}
There is a hint in there, json_name
. I should find a way to use this instead of black magic.
This code will take some time to think about and implement correctly. I'll be MIA for the next month so it will take time!
Well spoted!
I agree that using the json_name
is probably better than black magic.
However, I'm not sure it's a good idea to parse and apply the fdset at multiple places.
Don't worry, no urgence at all :wink:
Thanks for contributing BTW. Hope we can find a solution to this eventually.
Alright, @connesc - I'm back and ready to work on this again but I'm now realizing I don't like the solution I implemented. I liked pq the way it was - just chaining some command-line options and existing protobuf/json parsers.
I intend to clean up and remove the half-solutions I implemented. In the meanwhile, I'll see if I need to go upstream with this JSON canonicalization problem. Even better, I've been intending to write my own protobuf-to-json piece. Maybe now's the time to start on it.