pq icon indicating copy to clipboard operation
pq copied to clipboard

Canonical JSON mapping

Open connesc opened this issue 7 years ago • 26 comments

According to this page, Proto3 defines a canonical JSON mapping. It would be awesome if pq had an option to be compliant.

Currently, the single difference I'm aware of is the name of fields, that should be converted to lowerCamelCase. Handling this would be a great step further, but maybe there are other subtle issues to address in order to be 100% spec-compliant.

connesc avatar Nov 17 '17 14:11 connesc

Thanks for pointing this out. Do I basically go to every key in the JSON output and just do some string manipulation to convert them to lowerCamelCase? e.g. python pseudocode: 'snake_case'.split('_')[1] + ''.join([x.capitalize() for x in 'snake_case'.split('_')[1:]])

sevagh avatar Nov 17 '17 16:11 sevagh

Yes, sounds good to me ! :+1:

connesc avatar Nov 17 '17 17:11 connesc

Here's my attempt: https://github.com/sevagh/pq/pull/75/files#diff-d3074ba1689c29fa340ca2ef02837dfaL37

There's a lot of rules but I made some good progress in making a section where we can define rules easily.

sevagh avatar Nov 17 '17 19:11 sevagh

I ran this in prod and it works well. Not sure if I want it to be default or toggleable with a flag. --compliant?

sevagh avatar Nov 17 '17 20:11 sevagh

Wow, that was fast, thanks for your efforts! :+1:

I have tested v1.0.1. It seems to work very well, but only at the root level, not for sub-objects.

Regarding the default behavior, my guess is that a toggleable flag is perfect to remain compatible while in v1.x (--canonical sounds better than --compliant IMHO). However, it may also make sense to define the canonical mapping as default for the next major release, while still keeping a --legacy flag for current users.

connesc avatar Nov 20 '17 10:11 connesc

I like the idea to switch behaviors and add a legacy flag in the next major release.

For now I need to figure out a way to descend into the Serde::Value object to find every possible key to modify.

sevagh avatar Nov 20 '17 14:11 sevagh

I have tested v1.0.1. It seems to work very well, but only at the root level, not for sub-objects.

Actually maybe I'm misunderstanding the document but it seems to me that sub-objects are considered as map{"k": v} and thus don't need to be modified?

Message field names are mapped to lowerCamelCase and become JSON object keys.

Do message field names only apply to the root keys or even sub-objects? Are sub-objects messages or maps?

map<K,V> | object | {"k": v, …} | All keys are converted to strings.

sevagh avatar Nov 20 '17 14:11 sevagh

From what I understand, map<K,V> is for free-style key-value maps, where the keys do not actually have to be converted. The issue I encounter is when nesting messages like this:

syntax = "proto3";

message Parent {
        Child my_child = 1;
}

message Child {
        string foo_bar = 1;
}

If I decode 0x0A050A0362617A with --canonical, I get this:

{
  "myChild": {
    "foo_bar": "baz"
  }
}

Instead of this:

{
  "myChild": {
    "fooBar": "baz"
  }
}

connesc avatar Nov 20 '17 17:11 connesc

Thanks for the test case. I'll try some things.

sevagh avatar Nov 20 '17 18:11 sevagh

I think you're right about maps, this is what a real map looks like:

"myChild": [
    {
      "key": "foo_bar",
      "value": "baz"
    }
  ]

I created this proto to test it:

syntax = "proto3";

message ParentMap {
        map<string, string> my_child = 1;
}

So I think if I recursively check all map keys (where key != "key"), I can safely apply the transformation. The underlying Rust object that I traverse looks like this:

Nested proto case:

Map(
    {
        String(
            "my_child"
        ): Option(
            Some(
                Map(
                    {
                        String(
                            "foo_bar"
                        ): Option(
                            Some(
                                String(
                                    "baz"
                                )
                            )
                        )
                    }
                )
            )
        )
    }
)

Map case:

Map(
    {
        String(
            "my_child"
        ): Seq(
            [
                Map(
                    {
                        String(
                            "key"
                        ): Option(
                            Some(
                                String(
                                    "foo_bar"
                                )
                            )
                        ),
                        String(
                            "value"
                        ): Option(
                            Some(
                                String(
                                    "baz"
                                )
                            )
                        )
                    }
                )
            ]
        )
    }
)

sevagh avatar Nov 20 '17 18:11 sevagh

Tentative fix:

sevagh:python $ ./proto_encoder.py single nested | ../../target/debug/pq --fdsetfile ..
/fdsets/parent_child_nested.fdset --msgtype Parent --canonical
{
  "myChild": {
    "fooBar": "baz"
  }
}

sevagh avatar Nov 20 '17 19:11 sevagh

https://github.com/sevagh/pq/pull/76

sevagh avatar Nov 20 '17 19:11 sevagh

Wanna test this hotfix? pq-bin.tar.gz

It uses rust nightly to recursively peak inside the JSON and modify all keys. 1.0.2

sevagh avatar Nov 20 '17 19:11 sevagh

Well done, it works! :tada: I'm not familiar with Rust, what are the implications of using a nightly? Is it stable enough to release this hotfix?

To be spec-compliant, I think that map<K,V> should be handled differently. With your ParentMap message, I would expect 0x0A0E0A07666F6F5F626172120362617A to be decoded like this (keys are only converted to strings, no case transformation):

{
  "myChild": {
    "foo_bar": "baz"
  }
}

PS: if I understand correctly, what your are currently doing is a post-processing of the deserialized message. So, it's probably difficult to reliably distinguish a map<K,V> from a message. Maybe you can find inspirations in the official Java implementation.

connesc avatar Nov 21 '17 08:11 connesc

Nightly should have no serious implications, a lot of real production Rust code uses nightly.

PS: if I understand correctly, what your are currently doing is a post-processing of the deserialized message.

Yes, so I think handling your map<K, V> case is something that should be modified in my upstream protobuf dependencies (either rust-protobuf or serde-protobuf).

In fact maybe even --canonical's post-processing behavior belongs in serde-protobuf.

sevagh avatar Nov 21 '17 16:11 sevagh

I totally agree, this should ideally be handled earlier in the deserialization process. However, can we expect these libraries to implement this in the short term?

connesc avatar Nov 21 '17 17:11 connesc

sevagh:pq $ ./tests/python/proto_encoder.py  single nested | ./target/debug/pq --fdsetfile ./tests/fdsets/parent_child_nested.fdset --msgtype Parent --canonical
{
  "myChild": {
    "fooBar": "baz"
  }
}
sevagh:pq $ ./tests/python/proto_encoder.py  single map | ./target/debug/pq --fdsetfile ./tests/fdsets/parent_child_map.fdset --msgtype ParentMap --canonical
{
  "myChild": {
    "foo_bar": "baz"
  }
}

sevagh avatar Nov 22 '17 11:11 sevagh

Some more thoughts on the upstream/canonical JSON handling question:

  • rust-protobuf provides the lowest level of protobuf code. It doesn't care about JSON
  • serde-protobuf deserializes protobuf into serde_value objects. It doesn't care about JSON. Google doesn't have any rules on what language-specific intermediate formats should look like.
  • It is my personal choice in pq to feed serde-protobuf's output serde_value objects into serde_json to have JSON as a final output

Therefore, it should be my responsibility to massage the deserialized input to output canonical JSON. pq promises protobuf -> json. My dependencies don't.

sevagh avatar Nov 22 '17 12:11 sevagh

Well, it makes sense, thanks for these explanations.

Your latest changes look great! Are you still able to distinguish these two cases?

syntax = "proto3";

message ParentMap {
        map<string, string> my_child = 1;
}
syntax = "proto3";

message KeyValue {
        string key = 1;
        string value = 2;
}

message ParentMap {
        repeated KeyValue my_child = 1;
}

The first should be converted to {"k": v, …}, but not the second.

connesc avatar Nov 22 '17 15:11 connesc

I'll whip up a test case. I didn't notice the "RepeatedField" which defeats my Seq scan expectation.

sevagh avatar Nov 22 '17 15:11 sevagh

It fails on this footgun. At this point distinguishing a RepeatedField from a real map is impossible. That one needs to be upstream.

sevagh avatar Nov 22 '17 15:11 sevagh

Interestingly, look at what I find when I peak into the FileDescriptorSet object:

fdset: file {
  name: "schemata/parent_child_nested.proto"
  message_type {
    name: "Parent"
    field {
      name: "my_child"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_MESSAGE
      type_name: ".Child"
      json_name: "myChild"
    }
  }
  message_type {
    name: "Child"
    field {
      name: "foo_bar"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "fooBar"
    }
  }
  syntax: "proto3"
fdset: file {
  name: "schemata/parent_child_map.proto"
  message_type {
    name: "ParentMap"
    field {
      name: "my_child"
      number: 1
      label: LABEL_REPEATED
      type: TYPE_MESSAGE
      type_name: ".ParentMap.MyChildEntry"
      json_name: "myChild"
    }
    nested_type {
      name: "MyChildEntry"
      field {
        name: "key"
        number: 1
        label: LABEL_OPTIONAL
        type: TYPE_STRING
        json_name: "key"
      }
      field {
        name: "value"
        number: 2
        label: LABEL_OPTIONAL
        type: TYPE_STRING
        json_name: "value"
      }
      options {
        map_entry: true
      }
    }
  }
  message_type {
    name: "KeyValue"
    field {
      name: "key"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "key"
    }
    field {
      name: "value"
      number: 2
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "value"
    }
  }
  message_type {
    name: "ParentMapFootgun"
    field {
      name: "my_child"
      number: 1
      label: LABEL_REPEATED
      type: TYPE_MESSAGE
      type_name: ".KeyValue"
      json_name: "myChild"
    }
  }
  syntax: "proto3"
}

There is a hint in there, json_name. I should find a way to use this instead of black magic.

sevagh avatar Nov 22 '17 15:11 sevagh

This code will take some time to think about and implement correctly. I'll be MIA for the next month so it will take time!

sevagh avatar Nov 22 '17 15:11 sevagh

Well spoted!

I agree that using the json_name is probably better than black magic. However, I'm not sure it's a good idea to parse and apply the fdset at multiple places.

Don't worry, no urgence at all :wink:

connesc avatar Nov 22 '17 17:11 connesc

Thanks for contributing BTW. Hope we can find a solution to this eventually.

sevagh avatar Nov 22 '17 18:11 sevagh

Alright, @connesc - I'm back and ready to work on this again but I'm now realizing I don't like the solution I implemented. I liked pq the way it was - just chaining some command-line options and existing protobuf/json parsers.

I intend to clean up and remove the half-solutions I implemented. In the meanwhile, I'll see if I need to go upstream with this JSON canonicalization problem. Even better, I've been intending to write my own protobuf-to-json piece. Maybe now's the time to start on it.

sevagh avatar Dec 31 '17 00:12 sevagh