Cirq icon indicating copy to clipboard operation
Cirq copied to clipboard

Maybe explore how the file serialization format could include additional data such as a version number

Open mhucka opened this issue 2 years ago • 10 comments

Is your design idea/issue related to a use case or problem? Please describe.

Cirq currently provides methods to serialize circuits to/from a JSON format. However, Cirq's current serialization methods do not appear to record a version number or schema identification information in the JSON output: they only read/write the data structures. If this JSON content is stored in files, and the details of Cirq's JSON serialization format change in the future (something that is likely as Cirq evolves), then it may become difficult to determine which format is assumed by a given file, or worse, future versions of Cirq may throw errors on perfectly valid but old JSON format files. The undesirability of this has already been expressed by @Strilanc in a comment on issue #4321.

Describe your design idea/issue

The serialization format could include a version number or a schema URI. Since space taken up by the serialized representation and the speed of serialization are probably concerns, just a version number is probably the best compromise. For example,

{
  "_version": 1,
  ...
}
(Longer example – click this line to expand)
{
  "_version": 1,
  "cirq_type": "Circuit",
  "moments": [
    {
      "cirq_type": "Moment",
      "operations": [
        {
          "cirq_type": "GateOperation",
          "gate": {
            "cirq_type": "HPowGate",
            "exponent": 1.0,
            "global_shift": 0.0
          },
          "qubits": [
            {
              "cirq_type": "LineQubit",
              "x": 0
            }
          ]
        }
      ]
    },
    {
      "cirq_type": "Moment",
      "operations": [
        {
          "cirq_type": "GateOperation",
          "gate": {
            "cirq_type": "CXPowGate",
            "exponent": 1.0,
            "global_shift": 0.0
          },
          "qubits": [
            {
              "cirq_type": "LineQubit",
              "x": 0
            },
            {
              "cirq_type": "LineQubit",
              "x": 1
            }
          ]
        }
      ]
    }
  ],
  "device": {
    "cirq_type": "_UnconstrainedDevice"
  }
}

(One could perhaps shorten the field name to _v to save space, although it's not clear that it would be worth it, since the size of the rest of the JSON content is probably much more than the 6 characters "ersion".)

The reading and writing of the version number could be completely encapsulated in the Cirq methods that do the serialization – it's not something that user code would deal with. Moreover, if/when the format ever changes in the future, the serialization methods in Cirq could detect which version was in use and switch to the appropriate legacy parser as needed.

Given that existing files do not have version numbers, IMHO the best way to handle a transition to adding version numbers would be to introduce the format change as part of a major release of Cirq (say, version 1.0). All JSON produced by the new release would include version numbers; all existing files that lack a version number would automatically be assumed to use the pre-version 1.0 format.

Finally, having version information embedded in the JSON format would allow in the future the development of utility programs that can convert JSON between versions. This probably isn't needed right way, but it's another example of what is facilitated by embedding version information in the serialized data.

Potentially related issues:

  • #4321
  • #3438
  • #3539
  • #5264
  • #4318

mhucka avatar May 21 '22 16:05 mhucka

Cirq cync discussion:

  • With version numbers can interpret the same field with different semantic meanings, for instance.
  • At Cirq 1.0 is a good time to do this.
  • Should version 2.0 parse version 1.0 json files? Several options, but one option is to allow different parsers per version.
  • Downstream users can also use this version for verification or parsing.
  • Size of json is a concern. ie. put version string at "file-level" Note: cirq should automatically add this, and this would be automatically added.

Decision: tentatively marked accepted as part of cirq-cync but would lie input from @mpharrigan and @MichaelBroughton

dstrain115 avatar May 25 '22 18:05 dstrain115

I think this needs more details (questions to follow); but this sounds weird. Between the cirq_type and the attribute names and types, you should always be able to disambiguate between old and new data during deserialization and dispatch as appropriate in the _from_json_dict_ methods.

  1. what do you mean parsers? everything in json is parsed into dicts and lists. We have a mapping from cirq_type: str to Callable[[**Dict], LiterallyAnything]. Where would the version string change the behavior? Would it select different resolvers (the mapping mentioned previously)? Or would we have duplicate dunder methods on the classes: __from_json_dict_v1__(**kwargs), __from_json_dict_v2__(**kwargs)? Or would the version be passed in as a kwarg to each class? In either case: we let 3rd party classes register themselves. How would it select the correct resolver for 3rd party classes?
  2. How are you only going to put one instance of the version spec in the document? That's not really how it works. Each object turns itself into a dictionary without knowing if it's the top object.
  3. How are you going to extract the version spec from the top-level object and forward it to all the nested objects? That's not really how it works either.
  4. What if the top-level object is a list? or an integer?
  5. Do we really expect so much change going from cirq 1.0 to 2.0 that the assumption is everything needs to be deserialized differently? This concern has overlap with some of my other points, but: why does cirq-core's major version affect the serialization and deserialization of objects in our 2nd party packages (cirq_google) and 3rd party packages? it seems far too coarse-grained. Why not have a version for each class?
  6. Speaking of version for each class: why not just change the cirq_type if you have a truly breaking change. {"cirq_type": "FrozenCircuit_v2"}. Or pop a {"version": xx} during serialization of the new object. This would plumb nicely into the existing infrastructure.
  7. What sort of anticipated changes are we defending against? I know like binary formats run into places where the original format backs you into a corner but ... like .. our data format is json which doesn't change and is self-documenting. Our input schema is encoded into objects _from_json_dict_(**kwargs) method; which again seems very flexible. In our experience so far: the problems with json data-compatibility has come from Cirq removing classes (so we return like a tuple instead of a QidPair during deserialization). I can't imagine a case where you're passed some kwargs and you can't tell whether it's an old schema or a new schema. You'd need to take an existing cirq class; keep its class name; keep all its attribute names; keep all its attribute types; but change the meaning of all of them. If you're doing this, hopefully you'll think to add a disambiguator to your _to_json_ method.
  8. The only place where I can imagine this would be useful is if we want to stop supporting old json documents after a while. We could just look at the version and throw an error rather than dispatching to the callable (which could look at the cirq_type and fields and throw an error ... )

mpharrigan avatar May 26 '22 01:05 mpharrigan

Thank you for that followup comment. I appreciate very much that someone who knows the details of Cirq thoroughly is willing to take the time to provide detailed and constructive feedback. In response, I will try to address the 8 points and the broader context as best as I can.

Motivations and context

Stepping back and looking at the bigger picture, why did I bring up this issue at all? It seems pointlessly meddlesome, doesn't it?

My past experiences in creating interoperable data formats in other domains (e.g., 1, 2) led to some lessons about evolution of formats. At the request of the user community, we updated the formats over time. Attributes on data entities were sometimes added, removed or restructured, and in a few cases, attributes were not changed but something else about them changed, such as whether they were optional or required. None of the changes were predictable ahead of time. The format evolved over 20 years and is still in use today.

Here is one example of how version numbers were useful for us. Different people wrote desktop software and API libraries in different programming languages, but not always at the same pace as the format evolved. Some software kept up, but others – including some popular tools – did not get updated for years. You couldn't necessarily update the software yourself, nor update an API library on your computer and expect the software to use it. End users sometimes worked with a mix of old and new software, published papers, and uploaded files to online repositories. Not only did users save files in version N and then, years later, want to use the files in new software that supported version N + 1, but also, users sometimes tried to go in the other direction: use newly-created files in older software that was never updated past version N - 1. Older software couldn't hope to handle a newer version of the format – but how could the software tell? A file in a newer version might look like a syntactically invalid file written in a known current version. In the absence of a version number, when reading a file, how could the software distinguish between a file that was (a) valid but using a different version of the format, or (b) invalid because the software that produced it was buggy? Putting a version number in the stored data made it possible for software to know what was expected ("this is supposed to be version N"), and if the software couldn't handle it, inform the user why it couldn't ("sorry, this is an unsupported version of the format") rather than incorrectly claim that the file is invalid ("error on line 237") or just be vague about the reason ("unsupported file format"). An explicit version number reduced uncertainty, and allowed software to give users less confusing feedback.

When I saw JSON files in Cirq like cirq-core/cirq/protocols/json_test_data/Circuit.json lacking version info, it raised alarms. When I came across @strlanc's comment on issue #4321, it left the impression that there was potential for compatibility issues with future versions of Cirq. It felt like a situation where I could offer a suggestion based on past experiences. I hope I'm not wrong, but if I am, my sincere apologies – I'm trying to make a positive contribution, not being pedantic or nit-picky.

How can JSON versioning be done at all?

Broadly speaking, information about the version of the format in which some data is represented is really metadata. JSON is an extremely simple data format, and unlike a format like XML, JSON does not inherently offer a way to separate data from metadata. (JSON doesn't even have the concept of a comment!) So, how might we include version information?

The possible approaches boil down to two alternatives:

  • Alternative A: Add an attribute to an existing object (a dict) in the data. An example of doing this can be found in Node NPM package-lock.json files, which use the attribute lockfileVersion. (It was introduced in NPM version 5, and prior to that, no version info was provided in the file.)
  • Alternative B: Don't store the "actual" data at the top level; instead, have an object whose sole purpose is to be a kind of wrapper, and put the real data inside of that. Something along the lines of the following example:
    {
      "version": 1,
      "data": … actual data …
    }
    

Alternative A only works for some kinds of JSON data. It's messy to use if the top-level entity can be a list, and impossible to use if the top-level entity can be a number or string.

Alternative B works for all kinds of data, and offers some additional advantages. It conceptually separates the data from metadata, which is a good separation of concerns. It also offers a clean way to add other information besides version numbers, if desired. On the other hand, it requires a greater change to the serialization approach.

How can versioning be done in Cirq specifically?

I confess that when I wrote this GitHub issue initially, I only briefly looked at Cirq's serialization code and files like Circuit.json. I didn't study the situation carefully enough, and consequently, missed the fact that the JSON content was not always a dict. Now having looked at the serialization code in more detail (particularly json_serialization.py) and the "Serialization guidelines" document, I understand the contents can be anything, without any kind of wrapper/container around them. I now understand what @mpharrigan understood all along, and what led to his points/questions 2-4. Those questions don't have good answers in this case.

With things the way they are in Cirq currently, I think the best option today is what @mpharrigan articulated in his point 6, namely, change the value of cirq_type (e.g., from "foo" to "foo_v2") when a breaking change is made to some object class. This approach has the advantage of also addressing his point 1 and providing a reasonable way for 3rd party classes to handle their versioning.

What are we worried about, anyway?

For completeness, here's an attempt to address points 5 and 7, and comment on point 8.

Point 5: "Do we really expect so much change going from cirq 1.0 to 2.0 that the assumption is everything needs to be deserialized differently? [...] Why not have a version for each class?"

My original proposal explicitly tried to minimize the size and complexity of adding format version information to Cirq JSON files. That goal would have been supported by using a single overall number, but I see now that it can't be done the way I originally thought. Providing version information for each object class would be more a precise alternative, of course, but I think this would start to bloat file sizes and probably annoy developers.

For clarity, let me say that in the case of a single global version, a version change doesn't mean everything has to be deserialized differently. The deserialization code could, e.g., switch to a different handler for parts of the input, based on knowledge of the version. This is pretty common in parsers for other formats. A version number doesn't mean everything has changed, only that something has changed.

Point 7: "What sort of anticipated changes are we defending against?"

I hope Cirq will grow in popularity and be used for many years. This implies its serialization format may evolve over time (as these things often do), and this format may end up being used by other software, including software written by 3rd parties using other programming languages. They may not use the existing Cirq serialization code – perhaps someone at Google decides a faster simulator needs to be written in Go or Scala or whatever. The software may not be updated in a timely fashion, but some may be so popular or unique that people continue using it, even after Cirq and its format have evolved. This would be a sign of success for Cirq! But with different software, of different ages, written with different API libraries, solving compatibility problems is just more complicated. Without a version number, software and humans may have a harder time correctly attributing the causes of some of compatibility problems. To put it more simply, an explicit version number helps reduce uncertainty.

Point 8: "[...] if we want to stop supporting old json documents [...] We could just look at the version and throw an error rather than dispatching to the callable"

Yes, this is an important use case supported by having a version number. I think there is value, for the benefit of human software users as well as for debugging problems in software pipelines, to be able to say why something is not supported. "Error in input file" is more confusing and may lead to more issue tickets than "This file uses an unsupported version".

Conclusions

The current Cirq JSON serialization scheme and architecture is not well-suited to introducing a version number. An alternative approach to handling versioning with minimum disruption to the current scheme of things is what @mpharrigan wrote in his point 6: use cirq_type to convey version information when a type of object changes in a way that may affect readers and writers.

If other reasons cause Cirq to revise more substantially how its serialized representations are persisted to files, perhaps then it would be a good opportunity to review whether Alternative B described above (i.e., using a wrapper object) would be worthwhile, particularly if there comes a need to store additional information besides a version number. IMHO, this would provide more information to consumers of serialized Cirq representations, thereby reducing uncertainty.

The detailed and technical nature of @mpharrigan's questions afforded a valuable learning opportunity for me personally. I hope there is also value for Cirq's continued progress by having discussed these issues. I'm happy to continue explorations if needed.

mhucka avatar May 31 '22 17:05 mhucka

I think we both agree that designing defensibly against the future is a good idea :tm: and using the cirq_type field is a natural way to demarcate big changes in a particular type. I wanted to add some more color below, however.


In my mind, there is a distinction between the format --- which is JSON; which probably should have had a version identifier so they could introduce comments or metadata --- and the schema, which is very flexible and introspectable. The fact that the deserializer can poke and prod at the keys, their data, and return whatever type it wants means the vast majority of evolutions of our Cirq datatypes can be done without incident. For example:

  1. add a new attribute to a class. It can have a default (no changes required to default deserialization code!) or the deserializer can deduce that if that key-value doesn't exist, we're loading in an old file and can set the attribute to what it would have been in the olden days
  2. rename an attribute: if the deserializer sees the old name, it can forward it to the new name
  3. [hopefully this won't happen because it is a confusing API change for human users, but] rename an attribute, introduce a new attribute with the old name: if both keys exist, we're loading a new document; if only the old name exists, we're loading an old document.

That's why I'm saying it's really hard to imagine a schema change that needs an additional disambiguator. At that point, you should probably be using a different human-facing Cirq type name(!). But in any event, that would call for a cirq_type change.


I don't consider Cirq JSON to be an interchange format. This is due to the design motivations

  1. Really easy to dump Cirq objects as faithfully as possible. "Easy" means no- or low-code local changes
  2. Backwards compatible loading.
  3. (optional) somewhat self-describing so if some intrepid 3rd party developer wanted to load a subset of cirq json without importing cirq, they would have a fighting chance.

(1) means json or pickle since we're coming from python; (2) means json. Then we get (3).

If you want an interchange format for experimental data, circuits, bitstrings you should probably use something else (pandas.to_hdf5, qasm, numpy.save,...`)

The json schema is inexorable linked to the Cirq API, particularly its public objects. If a 3rd party developer wanted to be able to load in every type that could be in a cirq document, they would be re-implementing all of cirq.


users sometimes tried to go in the other direction: use newly-created files in older software that was never updated past version N - 1

This is not supported and would likely be a huge headache to guarantee. We have a half-decent unit test setup for checking backwards compatible loading, but it would be a lot of developer overhead to write backwards compatible saving; users would have to "opt in" to saving old schema versions, which they likely wouldn't; and you'd lose information (any new attributes or types added). Cirq is still adding and changing (usually in backwards-compatible; but not forwards compatible way) a lot. We sortof need this flexibility to write a quantum computing SDK, which is a quickly changing field.

In terms of error messages: I suppose it would be nicer for it to say "you're trying to load in something new! update your cirq". But with the current scheme, 1) it would still parse and 2) you'd get a reasonabye helpful error message like "couldn't find FrozenCircuit"; then you google frozen circuit and see that it was added in cirq 0.12 or whatever.

Nicer error messages would have to be balanced with the non-negligible overhead of now having a lot of cirq_type keys floating around that may or may not point to the same constructor.

mpharrigan avatar May 31 '22 18:05 mpharrigan

Thanks for those great comments! Some small follow-ups:

  1. I didn't make it clear in the (already too long) response earlier, but I do realize Cirq's JSON format isn't meant to be a true interchange format. Sometimes, though, internal formats get used by other software and slowly become de facto standards, so I thought there might still be value in thinking about it from that bigger perspective.

  2. Regarding the "other" direction issue (old software, new format): sometimes it's hard to prevent, because one can't always tell people what to do. OTOH, it's admittedly more of an issue for software-independent exchange formats and (hopefully) less for Cirq's situation.

Finally, I'm personally a little bit leery of hoping that people can google error messages and figure out the causes. I know you probably don't mean that as a general policy :-), but, while current devs probably can do that, I think it gets harder for outside devs, new devs, and nonexpert users (e.g., students in a quantum computing class). All that said, though, it's really hard to solve everything all at once when a field and software and infrastructure are all changing rapidly ...

Thanks again.

mhucka avatar May 31 '22 22:05 mhucka

Let's bring this up again at the Cirq Synch.

vtomole avatar Jun 01 '22 16:06 vtomole

Last week's Cirq Cynq included more discussion, but still without conclusions, and it prompted some at the time to characterize it as bikeshedding. I fear this whole issue has become an unintended distraction. I've been trying to think of the most efficient way to move forward, and want to suggest the following:

  1. Either (a) retag this issue from time/before-1.0 to time/after-1.0, or (b) close this issue. Rationale: as discussed above, what's being saved to files is plain JSON, which means that there is no uncomplicated way to do what was proposed originally. On balance, the ROI seems too low.
  2. (Possibly) open a separate issue to explore (post Cirq 1.0) what is written to files (and only files).
  3. (Possibly) in the "serialization guidelines" document, mention the idea of changing the cirq_type value if an object is substantially altered, per the discussion above.

Elaboration on point 2: my probably-naive thinking is that there might be some simple ways to accomplish this, such as enhancing the serialization methods to behave differently when the destination or source is known to be a file. The behavior then could perhaps be to write a dict that wraps the actual data and carries version and other informational attributes. (Maybe we can draw inspiration from what TensorFlow writes out?) If this is ever deemed worthwhile by other people here, I would offer to put my money where my mouth is and do some exploratory proof-of-concept implementation.

mhucka avatar Jun 15 '22 17:06 mhucka

Per today's Cirq Cync, @dabacon suggested the following concrete steps:

  1. retag this as time/after-1.0
  2. change the title of this issue to reflect the goal being to revisit what is written to files (instead of opening a different issue)

mhucka avatar Jun 15 '22 18:06 mhucka

This is a very useful discussion! I also thought more about the issue and I agree with most of what Matt has said. Here are my thoughts:

One of the most important points to focus on when thinking about this issue is that the JSON serialization used by Cirq is NOT an exchange format and is expected to be used primarily by Cirq to load the data and circuits generated by (potentially older versions of) Cirq. This is the primary reason why:

  • We have the flexibility of inspecting the contents of the serialized objects and dispatch to the logic of constructing new equivalent objects as part of deserialization. This flexibility would not exist if the serialized schema was supposed to be an exchange format, since other clients reading only the serialized data would also need to replicate the dispatch logic that Cirq has implemented, which is not scalable.
  • This is also the primary reason why we can guarantee that we would be able to support deserialization of cirq objects forever, which is very hard (impossible?) to guarantee if the serialization format is supposed to be an exchange format.

As an example, let's consider what would happen if we used protobufs instead of JSON for serializing Cirq objects. Suppose we have an XPowGate that accepts theta_in_radians as the only argument and we want to instead replace this argument by theta_in_degrees. In the current json serialization framework, we can just update the class to use new argument (via a deprecation cycle ofc) and then the deserialization logic can inspect if the json dict has a key theta_in_radians or theta_in_degrees (by design, only one of the two keys can exist at any point) and appropriately convert the radians to degrees and dispatch to the new logic if needed. But what would happen if we were using protobufs?

// Original protobuf message.
message XPowGate {
    float theta_in_radians = 1;
}
// Protobuf message during deprecation. 
message XPowGate {
    float theta_in_radians = 1; [deprecated = True]
    float theta_in_degrees = 2;
}
// Protobuf message after deprecation cycle.
message XPowGate {
    reserved 1;
    float theta_in_degrees = 2;
}

Notice how, to maintain backwards compatibility, we would end up needing version numbers and/or scripts to facilitate conversion from one version of the proto message to the next. This is because an arbitrary software that reads just the serialized data needs to know the schema that corresponds to the data. And as the schema updates over time, maintaining a version number allows us to create a mapping between serialized data and the schema that was used when the data was serialized.

In our case, the schema doesn't need to persist over time, since we are assuming that Cirq is going to be the only consumer of these serialized files and evolution of the schema over time is implicitly captured by the custom json deserialization logic that we have implemented.

Hope this helps to add some clarity to situation. Unless there are further concerns, I would vote that we can close this issue without needing to add any version numbering to our json serialization logic.

tanujkhattar avatar Sep 20 '22 00:09 tanujkhattar

Thank you @tanujkhattar for the thoughtful reply and further elaboration!

If things are left as-is (i.e., no changes to the json format), then for the sake of future developers, I suggest it would be worth adding some text to the "serialization guidelines" document to bring up some of what @mpharrigan wrote in his comment, about changing the cirq_type value if an object is substantially altered.

mhucka avatar Sep 28 '22 16:09 mhucka