jsonm icon indicating copy to clipboard operation
jsonm copied to clipboard

[RFC] Jsonm.Value: encoding and decoding complete values

Open gasche opened this issue 4 years ago • 14 comments

This is a Request For Comment rather than a complete pull request, although the implementation is sensible and I welcome review comments.

Jsonm is a streaming library that provides a linear view of a JSON document. The point of view of the current documentation is that encoding and decoding tree-representations of JSON values (which a large share of JSON users actually need/want) is left as an easy exercise for users, and given as an example in the documentation. However, this approach has the following downsides:

  1. The example in the documentation has bad error-handling, and users who naively reuse it get bad error-handling in their code. It requires some work to understand how to handle errors properly, and this work was duplicated many times.
  2. Support libraries pop up to implement JSON tree representations on top of Jsonm, Ezjsonm being the most popular (it has more dependents than Jsonm itself on OPAM), json_of_jsonm appears to be an unmaintained experiment. Guess what: their error-handling is rather bad.

I did the work to convert an Ezjsonm user to Jsonm to get proper error locations in https://github.com/rgrinberg/ocaml-mustache/pull/63 , and I would like to avoid this work to other users. I am planning to contribute better error-handling to Ezjsonm, but I think that the best approach for everyone would be to have proper tree-representation support as a convenience in Jsonm, well-integrated with its API.

The current PR is a proposal for such an integration of Json "values" (tree representations) in Jsonm. It uses the value type coming from the declaration example, but provides proper error handling and supports the Manual encoder/decoders. I believe that the API proposal and the implementation are solid, and there is some testing (no testing of Manual encoders/decoders, though), but I suspect that @dbuenzli would want to see improvements to the documentation. Given that there is a good chance that @dbuenzli would not be interested in the feature at all, I am submitting the PR in this current state (documentation not completely finished) to get a conversation started.

gasche avatar Jan 11 '21 09:01 gasche

I think integrating that in Jsonm is reasonable however one question needs answering.

The proposed representation will only improve generic JSON parsing error messages but in practice generic JSON is worthless. You always interchange JSON data according to a given schema.

Now most people will not bother using the streaming interface to parse schemas because it's painful (but useful in extreme cases, e.g. if you need to parse gigabytes of GeoJSON data). So they will use the value interface.

However the value interface you propose doesn't allow to report good error messages for schema errors -- like that object is missing a bla member or the member bli of that sub-object on that line is supposed to be a string not a number :-)

Wouldn't it make sense to have Value.t tuple precise locations of values ? Basically I think we should rather move towards Serialk's JSON representation which has locations attached to nodes -- note that the latter representation will change as I need more metadata so it's not about making it compatible with it.

If people complain that this is inefficient and costly, first scream at them that computers are for humans not for machines and then you can always provide them with a toggle that plugs nil location values if they want to suffer bad error messages.

dbuenzli avatar Jan 11 '21 10:01 dbuenzli

That's a good point, we could have a decoder that has locations for each value node. But it requires additional API design.

Maybe the value type could be parametrized over the type of location metadata, and in particular the encoder would be polymorphic over this type as it ignores locations?

Regarding locations: when I worked on better error-reporting in ocaml-mustache I found myself missing the obvious "type of source locations and functions for user-friendly formatting in messages" logic as an independent opam package. (I ended up reimplementing that for ocaml-mustache, but I would still prefer to share code.)

gasche avatar Jan 11 '21 10:01 gasche

Another remark: json_of_jsonm tries hard to integrate with monadic concurrency frameworks by providing versions abstracted over an IO monad (json_of_jsonm_monadic.mli; binds are inserted precisely in the `Await cases, here in decode and here in encode). I hope that the API proposed in this PR is flexible enough to let users rebuild this integration on top of it, but I have not checked it by actually writing the code -- this could be added as an test/example.

gasche avatar Jan 11 '21 11:01 gasche

Maybe the value type could be parametrized over the type of location metadata, and in particular the encoder would be polymorphic over this type as it ignores locations?

That would work for me I think.

Regarding locations: when I worked on better error-reporting in ocaml-mustache I found myself missing the obvious "type of source locations and functions for user-friendly formatting in messages" logic as an independent opam package. (I ended up reimplementing that for ocaml-mustache, but I would still prefer to share code.)

I think the best way to tackle this problem is to liberate the programmer from that duty and let the API solve it by itself.

I think `Serialk's query combinators manage to do this (including things like spell-checking dictionary keys etc.). I use it with s-expressions in brzo and it seems to work reasonably well to my taste -- it also has a lot of other good properties like support for the problem of layout preserving updates and allows partial format modelling (which is nice when you quickly want to query/update formats that you don't want to fully model).

I'd be curious on how it fares on your hairy example. Unfortunately right now the JSON parser needs to be rewritten to be location aware and the JSON query api needs to be put on par with the sexp one.

dbuenzli avatar Jan 11 '21 13:01 dbuenzli

I pushed another commit with the following interface:

type pos = int * int
type range = pos * pos

  type 'loc t = 'loc value * 'loc
  and 'loc value = [
  | `Null
  | `Bool of bool
  | `Float of float
  | `String of string
  | `A of 'loc t list
  | `O of ((string * 'loc) * 'loc t) list
  ]

  val decode : decoder -> range decode_result
  val encode : encoder -> 'loc t -> encode_result

Another interface I considered is to have two separate value types, one with locations and one without, with a function to erase locations (going from one to the other). decode produces a value with locations, and encode takes a value without locations. The benefit of this is that the without-locations version is much closer to the type currently used in Jsonm libraries that build trees (they often imitate your documentation example), so it would make for easier adoption on their side. The downside is that I find the naming a bit awkward: t and t_noloc? Ergh.

gasche avatar Jan 11 '21 20:01 gasche

I pushed another commit with the alternative interface I mentioned previously:

  module With_loc : sig
    type t = value * range
    and value = [
    | `Null
    | `Bool of bool
    | `Float of float
    | `String of string
    | `A of t list
    | `O of ((string * range) * t) list
    ]
  end

  type decode_result = [ `Value of With_loc.t | ... ]
  val decode : decoder -> decode_result

  module Without_loc : sig
    type t = value
    and value = [
    | `Null
    | `Bool of bool
    | `Float of float
    | `String of string
    | `A of t list
    | `O of (string * t) list
    ]
  end

  val erase_locs : With_loc.t -> Without_loc.t

  type encode_result = [ ... ]
  val encode : encoder -> Without_loc.t -> encode_result

I would be interested in knowing whether you have a preference. Personally I find that 'loc t is a slightly nicer design (it's nice that there is no need to erase locations before encoding back), but it's more complex and I think that in practice users will find it less useful, because it makes constructing trees without locations slightly more cumbersome. The new API has the following advantages:

  • it makes it easy for people that use Ezjsonm to call those functions directly if they need/want to
  • people who programmatically build JSON trees (without locations) have first-class support in the API

At the same time it avoids the defect of making it too easy to decode JSON trees without source locations (the "path of least effort" remains the better one, with locations).

Note: the naming of locations is not optimal, I use range for the type of character ranges to follow the existing choices, but I use loc in With{,out}_loc and erase_locs, because I don't think that With_ranges or erase_ranges would work. Are we okay with the discrepancy, should loc be used everywhere, do you have some other proposal?

gasche avatar Jan 12 '21 08:01 gasche

So here's what I think what I would do:

  1. No with/without_locs. That's needless bureaucracy and choice.
  2. Locations inside the cases. People who need more efficiency can move on to direct lexeme decoding or can try to work on the library to see if a layer of abstraction can be unpeeled to use the low-level decoding functions (which could likely support both styles by providing a function to construct the result of a decode).

I'm not too concerned about constructing values with dummy locations (just provide a constant for nil locations). This is a few combinators away if deemed too painful (see e.g. the API here). And in any case it's also usually not such a good idea to translate datastructures first to a generic JSON representation and then serialize them.

Also there's another thing that bugs me. I wish I knew better when I did this but the line/column positions (line number + Unicode scalar value on the line) strategy works nowhere. Nowadays I think that given the state of editors this is the right strategy both for humans and machines (e.g. to perform surgery if you still have the original input) but it's costly - 6 numbers to denote an input range, though line positions can be shared. (The underlying uutf decoder has the information)

That being said, all together I'm not sure whether we are not beating a dead horse here. Jsonm also supports (and pays for) things that are no longer needed by RFC8259 like UTF-16 decoding and maybe one day we can stop performing the CPS monkey dance at which point the interface and implementation will likely look sufficiently different to warrant a new library.

Personally for dealing with generic representations of JSON I nowadays rather use Serialk which I hope to bring to release at some point. I would turn to Jsonm only for the streaming lexemes support.

dbuenzli avatar Jan 12 '21 10:01 dbuenzli

The Tloc module looks very nice. Could it be released as an independent library?

(One thing that I think would be nice for third-party users would be to have functions from Lexing.position to your representation, with the understanding that they produce imperfect information as the input format has less information.)

gasche avatar Jan 12 '21 10:01 gasche

Regarding the dead horses all together: I'm fine with closing this PR. The next step in any case is to send a PR to Ezjsonm to report decoding-error locations (it's easy to do, but less ambitious as it does not provide locations inside values, which would also be beneficial.) Before that I may try at last attempt among the lines you suggested, because I think that (if we can get something you're okay with) it would still provide value to Ezjsonm and Jsonm users out there; as always there is considerable inertia so I don't see them moving to Serialk right away.

gasche avatar Jan 12 '21 10:01 gasche

because I think that (if we can get something you're okay with) it would still provide value to Ezjsonm and Jsonm users out there; as always there is considerable inertia so I don't see them moving to Serialk right away.

Yes sure. Just make the simple thing and I guess we can just merge that. Also forget about the location stuff, the lines are correct and the columns are on US-ASCII data which should be good enough®.

Also in Serialk for now I don't plan to have sophisticated incremental and non-blocking input strategies, I always work on complete strings which may put people off.

dbuenzli avatar Jan 12 '21 11:01 dbuenzli

I pushed a version with locations inside each constructor.

  type 'loc t = [
  | `Null of 'loc
  | `Bool of bool * 'loc
  | `Float of float * 'loc
  | `String of string * 'loc
  | `A of 'loc t list * 'loc
  | `O of ((string * 'loc) * 'loc t) list * 'loc
  ]
  (** The type of JSON trees is parametrized over the type 'loc
      representing source locations. In particular you can use
      [unit t] to represent JSON values without position information,
      and our decoders return [range t], pairing nodes in the tree
      with the corresponding character range in the input source. *)

  type 'loc decode_result = [ `Value of 'loc t | ... ]
  val decode : decoder -> range decode_result

  type encode_result = [ `Ok | ... ]
  val encode : encoder -> 'loc t -> encode_result

I you like it enough, I can do a pass of documentation polishing for the new stuff.

gasche avatar Jan 12 '21 22:01 gasche

(This is a gentle ping: it's completely fine if you are busy with other things, but if you are just procrastinating about saying "no", please do not hesitate to do so.)

gasche avatar Jan 15 '21 13:01 gasche

(Note: this PR is still on my TODO list, but I have been too busy with other things before it in the list for now. Apologies for the delay.)

gasche avatar Feb 28 '21 16:02 gasche

I'm not in a hurry.

dbuenzli avatar Feb 28 '21 18:02 dbuenzli