sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

Move away from `pickle`-based serialisation

Open AA-Turner opened this issue 5 months ago • 5 comments

Is your feature request related to a problem? Please describe.

Currently, Sphinx uses Python's pickle module for various serialisation-related functions:

  • Saving and restoring the build environment (app.env) for incremental rebuilds
  • Saving and restoring the Docutils AST ('doctree') of individual documents between the reading and writing phases
  • Reconcilling changes in the build environment between worker processes and the main process, when using parallel processing
  • Saving coverage data in sphinx.ext.coverage

The pickle module is convenient, but also insecure (unpickling can execute arbitrary Python code) and hard to introspect. I would like to replace our use of pickle with data-only serialisation.

Challenges include, but are not limited to:

  • There is no round-trip serialisation format for Docutils ASTs.
  • Extensions currently store arbitrary data directly on app.env, which includes arbitrary (non-builtin) types.

For non-pickle serialisation to be supported by extensions, I expect we will need to provide (a) hooks for serialising and deserialising and (b) an 'official' method of saving data to the environment.

In terms of implementation, we progressively convert attributes or data to be (de)serialised via a non-pickle mechanism.

Describe the solution you'd like

To eventually deprecate and remove all use of the pickle module.

Describe alternatives you've considered

The status quo.

A

cc @sphinx-doc/developers @sphinx-doc/triagers for input.

AA-Turner avatar Jun 03 '25 18:06 AA-Turner

Is pickle really that bad for the use case?

insecure (unpickling can execute arbitrary Python code)

Does this really increase the attack surface? For example, as soon as you have write access to conf.py Sphinx will execute arbitrary code. The pickle files are only created and used locally. They are not distributed with the project.

hard to introspect

What exactly is the issue here? Is it that debugging during development or checking that an existing pickle is still compatible with the state of the project.

timhoffm avatar Jun 03 '25 21:06 timhoffm

I would phrase it more as that the attack surface is overlapped -- we have a goal to remove conf.py. You're right though that it is not a pressing security concern, given that conf.py does exist.

Re: introspection, I recently debugged an issue where an environment pickle file was ~1GB, when it should have been on the order of single-digit megabytes. Even loading the pickle file required replicating sys.path changes and importing local extensions expected to be in scope. Using a pure-data format would mean that such debugging could use standard non-Python tools.

It would also eliminate a class of bugs we currently face with inconsistent objects between the pickle and runtime, though this is technically user error as the extension version in setup() should be increased.

A

AA-Turner avatar Jun 04 '25 17:06 AA-Turner

A good first step would be to formalize the storage of extension data. Dynamically adding attributes to app.env at runtime is quite wild. Requirements are:

  • extension data should be grouped by extension and separated from core sphinx data, e.g. something like app.env.extension_data["myextension"]

  • The extension data should be versioned, so that we can match loaded data to the data that the installed extension expects.

  • The data should (ideally) not contain classes that are defined in the extension, because that would fail unpickling if the extension is not available. In particular, the data structure that collects the values per extension (i.e. the result of app.env.extension_data["myextension"]) should not be defined in the extension.

    In the simplest form, this could be a dict. But it may be better that we provide a class such as

    class ExtensionData:
        def __init__(self, version, **kwargs):
             self.version = version
             self.__dict__.update(kwargs)
    

    and extensions would create ExtensionData("1.0", foo="name", bar=2) and use

    app.env.extension_data["myextension"].foo
    

    The kwargs are intentionally left dynamic so that the same ExtensionData class can be used across all extensions and there is no unplickling problem. Optionally, once/if pickling is removed, such ExtensionData classes could be replaced by a dataclass defined in the extension, which would support better typing/completion. The charm of using ExtensionData now is that usage of ExtensionData and a potenial future data class are idential, i.e. both are accessed by app.env.extension_data["myextension"].foo, which simplifies a possible transition.

I could draft a PR if there is interest.

timhoffm avatar Jun 04 '25 22:06 timhoffm

* There is no round-trip serialisation format for Docutils ASTs.

Docutils 0.22 provides round-trip serialisation with the "docutils-xml" writer and parser.

gmilde avatar Jun 11 '25 19:06 gmilde

The case for a non-pickle-based format also seems unclear to me. Granted that the existing pickle-based approach easily leads to waste (unnecessary data being serialized), but that could easily happen with another format also, especially if it supports cyclical object references). This will be a lot of churn and additional complexity for extension authors and the actual benefit is unclear.

jbms avatar Jun 12 '25 05:06 jbms