param Full JSON/YAML support for Param

Full JSON/YAML support for Param

Open jbednar opened this issue 2 years ago • 32 comments

Param's JSON support is currently quite limited. It can serialize and deserialize all the basic Parameter types, including types matching the native JSON/JS types (Integer, Float, String, etc.), containers of such types (List, Dict, etc.), or Python/Param-specific types that need to be stored in a special format and then decoded back into Python types (Tuple, Array, DataFrame, etc.). However, there is not currently any support for storing or restoring:

the Parameterized type of the object being serialized (which the deserialization code currently needs to know beforehand)
nested Parameterized objects (Parameter values that are themselves Parameterized objects)
directly to or from YAML (which is a superset of JSON)

These limitations prevent this form of text serialization from being used as a general-purpose human-readable text format for specifying collections of Parameterized objects. Ideally, we'd be able to serialize a large and comprehensive class of objects to pass around as JSON (in a server context), and would be able to accept a hierarchical YAML file instantiating various Parameterized classes, each of which may contain arbitrarily nested other Parameterized objects also specified. Of course, JSON itself was not designed to represent Parameterized objects or even Python objects, nor does it include explicit mechanisms for extending the language (unlike XML). Possible ways around this:

YAML does include custom-type support using !, and one approach is to support additional functionality when using YAML only (as described for e.g. pyyaml) that accepts a specified custom type and instantiates it. E.g.:

someobj:
  !custompackage.custommodule.CustomType 
    name: "My object"
    size: 7
    nestedobj:
      !custompackage.custommodule.CustomType2:
        name: "Nested object"
        desc: "str"

Establish a convention using JSON (and thus accepted in YAML), e.g. using a dict containing a special keyword like __type that you hope is unlikely to occur by chance. For example, in JSON:

{"someobj": 
    {"name":"My object", "__type":"custompackage.custommodule.CustomType", "size": 7, "nestedobj": 
        {"name": "Nested object", "__type":"custompackage.custommodule.CustomType2", "desc": "str""}}}

or in YAML:

someobj: 
  name: "My object"
  __type: "custompackage.custommodule.CustomType"
  size: 7
  nestedobj:
    name: "Nested object"
    __type: "custompackage.custommodule.CustomType2"
    desc: "str"

(I could easily have gotten any of the syntax here confused, so please update it if you spot an error!)

I think the pros of option 1 are that it fully follows the YAML spec and is thus well defined with no reliance on heuristics, and that it's clear to a non-Python recipient of such a file that it requires some custom handling to parse the ! entries. The cons are that it won't be supported for JSON output.

The pros of option 2 are that it's supported for both JSON and YAML equivalently, and the cons would be that it relies on a heuristic that won't necessarily be evident to some non-Python code that tries to parse such objects, potentially leading to undetected errors and missing functionality.

Anyone have an option 3? I think we should go ahead with some option, in any case.

Sep 01 '21 18:09 jbednar

This issue relates to a variety of previous issues and PRs that are worth reading before making decisions:

Current JSON support implementation: #414
Request for serialization to text files: #485
Earlier proposal for YAML support, written before the current JSON support was added (with extensive proposed YAML example syntax alternatives): #159
Earlier proposal that led to the current JSON literal serialization support: #382
Support for deserializing from a text file: #482 (open PR)
Earlier PR for JSON/YAML support, not merged: #153

Sep 01 '21 18:09 jbednar

Hello, occasional busybody here.

If the first concern can be addressed (determining the outermost parameterized type), the second concern shouldn't be an issue, right? You should be able to infer the appropriate parameterized class from the outermost class' member.

Personally I have very little stake in the matter because I can't imagine a scenario where I need to deserialize something which I know how to manipulate but not the class it belongs to. That said, I would vote for option 2 because a) there's really no other way around JSON; and b) option 1 will fail if the class is not available (no-one outside of the param ecosystem can manipulate it).

Thanks for your time, Sean

Sep 08 '21 21:09 sdrobert

If the first concern can be addressed (determining the outermost parameterized type), the second concern shouldn't be an issue, right? You should be able to infer the appropriate parameterized class from the outermost class' member.

Right; those first two concerns should be solved at once, if the type is stored along with the set of parameters; the same representation should work both for the outermost class and for nested classes.

Personally I have very little stake in the matter because I can't imagine a scenario where I need to deserialize something which I know how to manipulate but not the class it belongs to.

There are a lot of places this comes up:

Even if exporting and restoring only a single class, having that class declared can be an important validation, and allows detecting bogus files provided as arguments.
Even if you already know the class of what you are getting, users should be allowed to provide a subclass. E.g. you could always expect a Simulation, but you might be provided an EventDrivenSimulation, a TimebasedSimulation, etc., and you'd want to instantiate the correct class in each case so that you can handle all the parameters associated with the subclasses, not just the superclass.
Even if you know precisely which class you should be getting, if any of the parameters of that class can themselves be a Parameterized object (e.g. a Scene that might contain a Box with a Shape inside), you have to know the classes for all the subobjects in order to instantiate any of it.

That said, I would vote for option 2 because a) there's really no other way around JSON; and b) option 1 will fail if the class is not available (no-one outside of the param ecosystem can manipulate it).

I agree, though I would imagine that e.g. JavaScript could define some sort of handler for !!python objects...

Sep 14 '21 20:09 jbednar

I definitely do not like any approach that relies on YAML features not supported by JSON. Out of the two options presented above, I prefer the second.

That said, sticking the qualified path to the parameterized into the dictionary of parameters is only one approach. What is necessary is to associate the collection of parameters to the class and this could be done in two other obvious ways that map to JSON:

A tuple. The simplest case would be something like ('custompackage.custommodule.CustomType', {'size':7})
An outer dictionary which would be something like {'parameterized':'custompackage.custommodule.CustomType', 'parameters':{'size':7}}

Here I prefer the second option as the tuple is shorter but needlessly encodes semantics by position. In addition, I would want to support an extension of the second option to allow args and kwargs so you can reconstruct the object:

{'parameterized':'custompackage.custommodule.CustomType', 'parameters':{'size':7}, 'args':[33, 45], 'kwargs':{'foo':99}}

Of course supporting args and kwargs would need a mechanism that hooks into param but is supplied by the class (e.g. with some special method).

Note that we now have three different possible JSON schemas for our object depending on which approach is chosen. I think this explains why the current schema method needs some outer padding to a 'full' schema: the form of that outer padding depends on the convention used. I am in favor of my last suggestion as you no longer have any possibility of a namespace clash, it is explicit and it is extensible.

Sep 14 '21 20:09 jlstevens

@jlstevens , how does your proposed tuple support nested objects? E.g. if the value of a param.Parameter is ('custompackage.custommodule.CustomType', {'size':7}), how can we know that should be instantiated as CustomType(size=7), as opposed to the literal tuple (str, dict) specified? Having a heuristic special value like __type (or some even more obscure name) is designed to make it possible to detect that it's not simply an ordinary literal. Presumably for your dict proposal, 'parameterized' could be '__parameterized' or something obscure that achieves the same thing. I guess for the tuple it could be ('__parameterized', 'custompackage.custommodule.CustomType', {'size':7})?

With any of these approaches, I think it's very important to consider how it would be expressed in YAML. One of the main uses for such serialization is for a human being to be able to write a reasonable, clear, concise, and compact YAML specification, and I'd like to avoid adding extra levels of nesting that are not typically needed. I agree it's nice to have a place for args and kwargs, but I'm not sure that justifies an extra nesting level as I think there are other ways to handle those.

Sep 14 '21 21:09 jbednar

Because JSON doesn't have the concept of classes, all of these proposals will always be heuristics. For every scheme above there is an object that will be understood to be a class by param when in fact it isn't.

Your question indicates another reason I prefer the dictionary format to the tuple format. If a dictionary has a key called parameterized and that fully qualified path points to a real parameterized class in the namespace, then the intention is pretty likely to be that it should be turned into that parameterized object. Of course that might not be the case, but it seems unlikely that you would stumble on that by accident (as unlikely as using __type name for instance and I don't think __parameterized helps much here).

With any of these approaches, I think it's very important to consider how it would be expressed in YAML.

Agreed which is why I would like to avoid the horrible names starting with one or more underscores.

I agree it's nice to have a place for args and kwargs, but I'm not sure that justifies an extra nesting level as I think there are other ways to handle those.

I think the extra level nesting is better (clear and explicit) even ignoring args and kwargs. That those could also be supported is a nice additional bonus imho.

Sep 14 '21 21:09 jlstevens

Note the args and kwargs options would be optional even if supported. An object's YAML could be something like:

parameterized: "custompackage.custommodule.CustomType"
parameters:
  name: 'my-object'
  size: 7

In practice this would typically be put under a key in YAML but that is not part of the yaml param needs to process to build the object (and not the YAML param would necessarily generate for you):

someobj:
  parameterized: "custompackage.custommodule.CustomType"
  parameters:
    name: 'my-object'
    size: 7

I considered whether the name parameter could be used for the outer key (dropping it from the parameter list) but thinking about this more, this approach won't really work consistently for nested parameters where the parameter name may have no relation to the name of the nested parameterized object. In other words, the name someobj has to come from the user and is not something param can generate for you automatically.

I also think this is a reasonable amount of nesting: deeply nested parameterized objects are going to be horrible no matter what so I think we should not consider what nesting looks like too much and only make sure it would be supported.

Sep 14 '21 22:09 jlstevens

For a nested Parameterized, the someobj name would be the name of the Parameter whose value is a Parameterized, e.g. nestedobj above. We could also require such a name at the top level, but at that point it just declares the Python attribute name of the whole object you're returning, and usually you wouldn't want to put some arbitrary object in the global or some other namespace (e.g. topo.sim:); instead the reader would know where they want the result to go already and only wants the value, not the name. So I don't think it makes sense to require a name at the top level, as it will just be ignored, right?

I don't think it's safe to ignore nested objects, because the main reason to use Param for a configuration file compared to e.g. TOML or INI is the power of compositionality. For your proposal I think the nesting would look like:

someobj:
  parameterized: "custompackage.custommodule.CustomType"
  parameters:
    name: 'my-object'
    size: 7
    nestedobj: 
      name: "Nested object"
      parameterized: "custompackage.custommodule.CustomType2"
      parameters:
        desc: "str"

Sep 14 '21 23:09 jbednar

... instead the reader would know where they want the result to go already and only wants the value, not the name.

Agreed. I'm just saying that the serialization will be in some context that param doesn't know about but the user does. The output will be under some sort of key in some sort of namespace of which the global namespace is the simplest option (and I agree, doing this automatically won't be that useful in general).

Your example of what I am proposing for nested objects almost looks correct and imho it isn't at all bad! The nice thing about YAML (like Python) is that indented blocks are very easy to identify and isolate visually.

The reason I say 'almost' is that I don't see why the name key for nestedobj (value of "Nested object") shouldn't be under the parameters list like all the other parameters? I think it should be as follows to be consistent:

someobj:
  parameterized: "custompackage.custommodule.CustomType"
  parameters:
    name: 'my-object'
    size: 7
    nestedobj: 
      parameterized: "custompackage.custommodule.CustomType2"
      parameters:
        name: "Nested object"
        desc: "A description of some sort"

Here the long parameterized keys (and potentially long qualified namespaces) also help to visually identify the nested objects. The class also jumps out at you so you can easily mentally switch your mental model to the type of class you are now talking about. I rather like it!

Sep 15 '21 09:09 jlstevens

I think the name should be under parameters 1) It is easier to read and 2) name as some kind of special/ reserved parameter with special behavior that is always there is not a good idea for all use cases. In my experience its problematic.

Sep 15 '21 12:09 MarcSkovMadsen

I agree, name should have been in the parameters list for the subobject same as it is for the outermost object.

Sep 15 '21 14:09 jbednar

The main problem I see with proposals as laid out above is that object references are not handled, i.e. if two objects reference the same subobject there is no way to express that given the schema laid out above. This can only really be cleanly handled by never inlining a subobject and instead keep all parameterized objects indexed by some custom id and then having some syntax to resolve those references, e.g. here's what such a schema might look like:

<parameterized1>:
  type: "custompackage.custommodule.CustomType"
  parameters:
    name: 'my-object'
    size: 7
    nestedobj: ref(<parameterized2>)
<parameterized2>: 
  type: "custompackage.custommodule.CustomType2"
  parameters:
    name: "Nested object"
    desc: "str"

Sep 20 '21 15:09 philippjfr

Good point; referencing subobjects is important. Still, why would you say "never" inlining a subobject? I would think there are very many cases where subobjects are only needed by their immediate owner, and having to create a separate object and have someone track it down in the file seems confusing and like a hassle for many use cases.

Or did you mean that automatic serialization would always create such separate objects, to make the implementation easier, while a user hand-writing their own .yml file wouldn't have to do that?

Sep 20 '21 15:09 jbednar

Or did you mean that automatic serialization would always create such separate objects, to make the implementation easier, while a user hand-writing their own .yml file wouldn't have to do that?

Yes that seems appropriate to me, for the automated case it just seems weird to have it be inconsistent. That said I wouldn't oppose an option that enables the inlining. I guess my only other question is what the point of the top-level key in the original schema was, i.e. where did someobj come from? You obviously need such a key for the references but in their absence I don't understand it.

Sep 20 '21 15:09 philippjfr

Right; see "We could also require such a name at the top level," above, where I point out that the top-level someobj name serves no current purpose besides consistency for how we specify subobjects. But you're right that it could serve a referencing purpose, which is great!

BTW, the name parameters generated by Param are already globally unique (Parameterizedclassname+objectinstantiationcounter) by default. I wonder if it makes sense to use them for such references, though that's problematic because users can override the names to anything they like, and we'd need some unique name to use in that case. Maybe use the name parameter for the reference by default, but then assign a unique id to anything clashing?

In any case, when a user is handwriting files, forcing them to make a globally unique identifier for each subobject is a burden I'd rather avoid, while inlining them avoids that (at the cost of making references impossible). If we could refer to nested objects by relative or absolute paths, then only the path needs to be unique, which is much easier for a human to ensure, but that seems like a lot of structure to be trying to build on top of YAML/JSON.

Sep 20 '21 15:09 jbednar

Maybe use the name parameter for the reference by default, but then assign a unique id to anything clashing?

You don't need to resort to a uuid or something equally unreadable. It could be also be 'name'+number to disambiguate...

I would prefer to inline unless references are actually needed but that does complicate the implementation and format specification.

Sep 20 '21 16:09 jlstevens

Regarding the name as a uuid. For me it’s been a mental challenge all the way of using param that

name is a special parameter on a parameterized class making it difficult to use name in cases where name is a natural part of the mod
Name is often unique. But does not have to be.
I’ve experienced sometimes that I cannot specify a name on instatiation of some Parameterized object. My argument is simply not used.
Maybe it’s documented now. But I’ve never seen this behavior explained.

Sep 21 '21 02:09 MarcSkovMadsen

I agree that the name parameter has always been a special case which can be confusing.

That said, when I mentioned the uuid (and not wanting to use that) I was talking about the reference key in the yaml file (which I am saying could use the name but make sure it is unique). Even with the issues you point out with the auto-generated names, I still think it would be easier for a human to use than a UUID.

Sep 21 '21 10:09 jlstevens

I’ve experienced sometimes that I cannot specify a name on instantiation of some Parameterized object. My argument is simply not used.

If you catch that happening, please file an issue; I don't think there is any valid scenario where the name would be ignored.

Sep 21 '21 15:09 jbednar

If you catch that happening, please file an issue; I don't think there is any valid scenario where the name would be ignored.

This would likely be because some subclass isn't appropriately passing through name to the super call.

Sep 21 '21 17:09 philippjfr

For what it's worth, I didn't find any Parameters treating name specially on a quick browse through the source code.

Sep 21 '21 17:09 jbednar

I think @philippjfr means that user code defining parameterized classes could fail to pass name to the super call, either accidentally or for some other reason (i.e. on purpose). This would be about how param is used and not something in param itself.

Sep 21 '21 17:09 jlstevens

Right; I looked at each of the __init__ methods in param/__init__.py, and (a) each of them calls the init method on their superclass, (b) none of them appears to handle name specially, and (c) all of them pass **kw (by a variety of names) up to the superclass. So each of them should be processing name precisely the same as all the rest, as far as I can see. So if there's a problem with name on some Parameter subclass, it's very mysterious!

ETA, oh, from your edit, you mean a Parameterized class could fail to call super, not a Parameter class? If so, maybe, but that would presumably cause all sorts of other problems, and wouldn't be an issue with Param itself. Sorry for the confusion; for some reason I got sidetracked looking at Parameter names, but the issue here is with Parameterized names. Issues with Parameterized names would be due to the person who wrote the Parameterized class, not to do with Param itself, since Param provides only three Parameterized classes, all of which handle name consistently.

Sep 21 '21 18:09 jbednar

... you mean a Parameterized class could fail to call super, not a Parameter class?

Right.

... but that would presumably cause all sorts of other problems, and wouldn't be an issue with Param itself.

It may not cause any problem other than name not being respected and maybe there is a good reason the subclass does not want to expose that parameter. I agree it wouldn't be an issue with Param itself.

Sep 21 '21 18:09 jlstevens

+1 to this topic overall, and +1 to @philippjfr proposed solution for nested object references. I want to convert lists of parameterized objects into pandas dataframes of their respective serialized data.

Mar 24 '22 05:03 LinuxIsCool

Sounds good! Anyone want to volunteer to start the first pass at implementing this? I think it's pretty well constrained by the above discussion, but would need to be typed up and then have a lot of testing.

Mar 25 '22 01:03 jbednar

On a related note, I'm wondering how amenable the team would be to a non-default, "reckless" mode of JSON serialization/deserialization which ignores the above and just goes by best-guess -- that is, the class passed to the ClassSelector is exactly the class which would be instantiated upon deserialization. Here's a gist of how it could look. The benefit of having this alongside what's being proposed here is that the output would be much simpler and easier to read in the case where all this reference passing and subclass stuff.

Sep 07 '22 20:09 sdrobert

@jbednar I've implemented YAML serialization by splitting up the functionality of JsonSerialization in pydrobert-param (relevant code here). Please let me know if you'd like this in a PR.

I've also implemented "reckless" versions of both JSON and YAML serialization.

Sep 21 '22 18:09 sdrobert

On a related note, I'm wondering how amenable the team would be to a non-default, "reckless" mode of JSON serialization/deserialization which ignores the above and just goes by best-guess -- that is, the class passed to the ClassSelector is exactly the class which would be instantiated upon deserialization. Here's a gist of how it could look. The benefit of having this alongside what's being proposed here is that the output would be much simpler and easier to read in the case where all this reference passing and subclass stuff.

We would definitely like to have nested YAML and JSON de/serialization, but could you explain more what you mean by "reckless"? Your description of subclass handling sounds precisely like what's described above, i.e. that the serialized file encodes the type of the Parameterized class, and the deserialized class is the same type. The discussion above is about ways to achieve preserving the type, not whether to preserve it.

References, on the other hand, do need to be handled by any general serialization implementation, or else any Parameterized that somewhere inside it has Parameters both set to the same Parameterized will deserialize to two copies of that object, while the original has only one object, leading to very serious and subtle bugs. Plus it will just crash if there are any circular references, with a Parameterized having a Parameter that points to some object that has a Parameter pointing back to the original.

If you are using serialization as a configuration file format rather than a persistence mechanism then the reference issue is less serious, but it still means that the configuration file cannot express shared nested subobjects or circular references.

In any case, I'd love to see this support as a PR, particularly if it comes with example configuration files that deserialize and then serialize back to essentially the same form. Sorry that we haven't addressed this already!

May 18 '23 02:05 jbednar

Hi @jbednar.

The "reckless" mode is indeed reckless because it does neither of the things desired above. In other words, it a) tries to (de)serialize exactly the class specified in the ClassSelector definition; and b) does not handle references (nested instances are considered distinct). The details are listed in the note here. As you say, for configuration files it isn't much of an issue, which is why I use it.

Since it's an opt-in strategy of (de)serialization, I don't think its inclusion harms anything. Furthermore, it can be paired with whatever referencing strategy is decided upon at a later date: the only cases it makes a reckless decision for are those which would normally fail.

Regardless, the YAML support I propose mirrors the JSON support. I was planning to make these PRs separately. Which, if any, do the team want?

May 18 '23 21:05 sdrobert

param param copied to clipboard

Full JSON/YAML support for Param

param
param copied to clipboard