sphinx
sphinx copied to clipboard
Move away from `pickle`-based serialisation
Is your feature request related to a problem? Please describe.
Currently, Sphinx uses Python's pickle module for various serialisation-related functions:
- Saving and restoring the build environment (
app.env) for incremental rebuilds - Saving and restoring the Docutils AST ('doctree') of individual documents between the reading and writing phases
- Reconcilling changes in the build environment between worker processes and the main process, when using parallel processing
- Saving coverage data in
sphinx.ext.coverage
The pickle module is convenient, but also insecure (unpickling can execute arbitrary Python code) and hard to introspect. I would like to replace our use of pickle with data-only serialisation.
Challenges include, but are not limited to:
- There is no round-trip serialisation format for Docutils ASTs.
- Extensions currently store arbitrary data directly on
app.env, which includes arbitrary (non-builtin) types.
For non-pickle serialisation to be supported by extensions, I expect we will need to provide (a) hooks for serialising and deserialising and (b) an 'official' method of saving data to the environment.
In terms of implementation, we progressively convert attributes or data to be (de)serialised via a non-pickle mechanism.
Describe the solution you'd like
To eventually deprecate and remove all use of the pickle module.
Describe alternatives you've considered
The status quo.
A
cc @sphinx-doc/developers @sphinx-doc/triagers for input.
Is pickle really that bad for the use case?
insecure (unpickling can execute arbitrary Python code)
Does this really increase the attack surface? For example, as soon as you have write access to conf.py Sphinx will execute arbitrary code. The pickle files are only created and used locally. They are not distributed with the project.
hard to introspect
What exactly is the issue here? Is it that debugging during development or checking that an existing pickle is still compatible with the state of the project.
I would phrase it more as that the attack surface is overlapped -- we have a goal to remove conf.py. You're right though that it is not a pressing security concern, given that conf.py does exist.
Re: introspection, I recently debugged an issue where an environment pickle file was ~1GB, when it should have been on the order of single-digit megabytes. Even loading the pickle file required replicating sys.path changes and importing local extensions expected to be in scope. Using a pure-data format would mean that such debugging could use standard non-Python tools.
It would also eliminate a class of bugs we currently face with inconsistent objects between the pickle and runtime, though this is technically user error as the extension version in setup() should be increased.
A
A good first step would be to formalize the storage of extension data. Dynamically adding attributes to app.env at runtime is quite wild. Requirements are:
-
extension data should be grouped by extension and separated from core sphinx data, e.g. something like
app.env.extension_data["myextension"] -
The extension data should be versioned, so that we can match loaded data to the data that the installed extension expects.
-
The data should (ideally) not contain classes that are defined in the extension, because that would fail unpickling if the extension is not available. In particular, the data structure that collects the values per extension (i.e. the result of
app.env.extension_data["myextension"]) should not be defined in the extension.In the simplest form, this could be a dict. But it may be better that we provide a class such as
class ExtensionData: def __init__(self, version, **kwargs): self.version = version self.__dict__.update(kwargs)and extensions would create
ExtensionData("1.0", foo="name", bar=2)and useapp.env.extension_data["myextension"].fooThe kwargs are intentionally left dynamic so that the same
ExtensionDataclass can be used across all extensions and there is no unplickling problem. Optionally, once/if pickling is removed, such ExtensionData classes could be replaced by a dataclass defined in the extension, which would support better typing/completion. The charm of usingExtensionDatanow is that usage ofExtensionDataand a potenial future data class are idential, i.e. both are accessed byapp.env.extension_data["myextension"].foo, which simplifies a possible transition.
I could draft a PR if there is interest.
* There is no round-trip serialisation format for Docutils ASTs.
Docutils 0.22 provides round-trip serialisation with the "docutils-xml" writer and parser.
The case for a non-pickle-based format also seems unclear to me. Granted that the existing pickle-based approach easily leads to waste (unnecessary data being serialized), but that could easily happen with another format also, especially if it supports cyclical object references). This will be a lot of churn and additional complexity for extension authors and the actual benefit is unclear.