Mutability of documents returned by databroker

Open danielballan opened this issue 5 years ago • 0 comments

Documenting a train conversation with @tacaswell where we try to recall and revisit the ~2014 decision to make the documents returned by databroker immutable at the top level only. attn @gwbischof

The primary concerns, with the most important first, are:

What if users mutate an object that is also stored in an internal databroker cache? The object would be permanently changed for the lifetime of the process, and it would be impossible to re-request the original (except by clearing internal caches).
What if users make uncoordinated changes that result in a invalid document stream, such as adding a key to an Event but neglecting to add it to the Event Descriptor?
What if one part of user code mutates a document and that change has unexpected consequences for another part of user code? This was a more serious concern in 2014, when event['descriptor'] contained a reference to the actual Event Descriptor dict, shared by all Events in the stream, rather than the Event Descriptor's uid. One change could propagate out.

Our options are: (1) Return plain dicts, deep-copying any cached documents on the way out to avoid cache corruption. (2) Return dicts with top-level keys immutable, deep-copying as in (1). (3) Return dicts that are recursively immutable, made up of frozen dicts instead of dicts, tuples instead of lists, and read-only numpy arrays. No need for deep-copying here because nothing is mutable.

All of the options address the first, most important concern with an engineering control. They make it technically impossible for the user to corrupt the data that is internally cached. The others take different approaches to addressing the two remaining concerns. Option (1) relies on process control, leaving it up to the user to understand the risks of mutation and handle any consequences if mutation is used. Option (3) applies an engineering control. Option (2) is a trade-off, applying a part-way engineering control that covers some common cases and may remind users of the process control.

We already need special class in order to get the tokenization speedups in #489. Given that, we cannot return ordinary dicts, so we might as well offer the part-way engineer control of (2). In our view, it is worth investigating what the costs of moving all the way to (3) would be, but it is not a priority. The includes the performance cost of recursively building up a document out of immutable types and the API complexity costs of handling the user this object. For example, with (2), users can do dict(doc) to get a fully-mutable object that they can, say, apply revisions to. With (3), a special conversion function like make_mutable(doc) would be needed.

Action items:

[x] Make the new Document type top-level immutable, just as doct.Document is for v0. #491
[ ] Reinstate just enough deep copying to protect against cache corruption. Add a flag to let advanced consumers turn off copying for performance. Some of this may have to happen at the driver layer. #496
[ ] In the longer term, investigate costs of Option (3).

Jan 14 '20 14:01 danielballan