srsly icon indicating copy to clipboard operation
srsly copied to clipboard

Unvendor all libraries

Open crusaderky opened this issue 1 month ago • 7 comments

  • Closes #116
  • Closes #83
  • Closes #89
  • Supersedes #110
  • Supersedes #111
  • Supersedes #117

High level changes

In an effort to reduce the maintainance burden, unvendor all libraries except msgpack-python and upgrade them to the latest available version, with sane but reasonable version pins.

This causes the previously vendored libraries to be bumped up several years all at once. For this reason, I'd rather not merge this PR until we gather some confidence that it doesn't introduce regressions down the line.

This PR causes srsly to become a much simpler pure-python package.

Breaking changes

This PR removes srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, srsly.ujson. This could cause breakages downstream. The trivial fix is to simply use msgpack, cloudpickle, ruamel.yaml, and ujson directly.

Free-threading (noGIL)

This PR does not, as of today, make srsly compatible with free-threading, but it will automatically do so as soon as the below issues will be fixed upstream and new upstream releases become available:

  • https://github.com/msgpack/msgpack-python/issues/613
  • https://github.com/ultrajson/ultrajson/issues/684

Other changes

  • Bump version to 3.0, in an effort to avoid accidental bumps in downstream projects.
  • Fix ujson segfault in Python 3.14 (see #117)
  • Introduce a minor change in json output, where dicts gain a whitespace between key and value (from {'x':1} to {'x': 1}. This should be purely cosmetic but may cause some downstream unit tests to trivially fail.
  • Add CI coverage for the use case where numpy is not installed

msgpack-specific changes

msgpack-python could not be unvendored. The reasons are that the latest upstream version of it

  • is quite old and uses a hacky patch system that offers no guarantee to remain feasible vs. future versions of msgpack, which is instead actively developed
  • introduces a regression where it fails to round-trip np.float64, due to it being a subclass of float
  • introduces a hard dependency to numpy

So instead I heavily reworked the fork that we have and added extra tests.

I also wrote from scratch a system for third-party msgpack extensions that is compatible with the previously vendored system, which is no longer available upstream.

Additionally:

  • Fix bug where built-in complex could not be serialized by msgpack unless numpy is installed
  • Fix bug where gc would remain permanently disabled if msgpack decode fails for any reason, e.g. in case of corrupted stream or if a third party extension raises.
  • Change exception when trying to decode a numpy object without having numpy installed from a nebulous AttributeError to a clearer ModuleNotFoundError

crusaderky avatar Nov 19 '25 08:11 crusaderky

FWIW, I mentioned having concerns about the library that led to this decision to vendor it and remove these unsafe branches. I found the message where I explained this to our team at the time:

So YAML has this infamous vulnerability where if you are so foolish as to write yaml.load as opposed to yaml.safe_load, you allow arbitrary code execution, as the YAML message may contain pickled Python objects. So consider these lines: https://github.com/pycontribs/ruamel-yaml/blob/master/main.py#L1392 if issubclass(Loader, BaseLoader): BaseConstructor.add_constructor(tag, object_constructor) elif issubclass(Loader, SafeLoader): SafeConstructor.add_constructor(tag, object_constructor)

Looks...alarming, right? Surely the order of conditions is wrong, so it will never enter SafeLoader :scream: Ah but wait...silly me. Why would I think SafeLoader is a subclass of BaseLoader? In fact SafeLoader has 99 base classes but BaseLoader ain't one:

class SafeLoader(Reader, Scanner, Parser, Composer, SafeConstructor, VersionedResolver):

This wasn't originally written as a public message so it's more flippant than I'd generally be --- no code is perfect, and this relates to stylistic concerns that are matters of judgment. But my concern was that future versions of the library could easily introduce a regression that affected the safe loading feature.

As I said on the call I can agree to unvendor the code as the cost/benefit analysis on the maintenance burden is different now. But I definitely want to make sure we maintain the behaviours that we had before, of forcing the library to only work in safe mode. A test that checked that the code execution features are in fact disabled would also be appreciated.

honnibal avatar Nov 20 '25 12:11 honnibal

Not sure I understand your worry here. It seems to me that to get a code injection one needs to

  1. deliberately bypass the srlsy API and directly use either ruamel.yaml or the backards compatibility alias srsly.yaml, and
  2. explicitly opt in with typ='unsafe'.

I too remember the time where yaml.load was unsafe by default, but that was many years ago.

>>> import io
>>> import srsly
>>> from ruamel.yaml import YAML

>>> class C:
...     def __new__(cls):
...         print("Arbitrary code execution!")
...         return object.__new__(cls)

>>> c = C()
Arbitrary code execution!

>>> srsly.yaml_dumps(c)
RepresenterError: cannot represent an object: <__main__.C object at 0x724fd4819e10>

>>> buf = io.StringIO()
>>> yaml = YAML(typ='unsafe')
>>> yaml.dump(c, buf)
>>> buf.getvalue()
'!!python/object:__main__.C {}\n'

>>> buf.seek(0)
>>> yaml.load(buf)
Arbitrary code execution!
<__main__.C at 0x724f75c75050>

>>> yaml2 = YAML()
>>> buf.seek(0)
>>> yaml2.load(buf)
{}

>>> srsly.yaml_loads(buf.getvalue())

ValueError: Invalid YAML: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:__main__.C'
  in "<unicode string>", line 1, column 1:
    !!python/object:__main__.C {}
    ^ (line: 1)

I've added a unit test to verify this behaviour.

crusaderky avatar Nov 20 '25 12:11 crusaderky

Just because I'm not familiar with the package: are the tests all inherited from upstream or did you do the edits to the tests manually?

All surviving tests are unique to srsly. All changes in them are my own edits. I've deleted all tests that were inherited from upstream.

crusaderky avatar Nov 20 '25 12:11 crusaderky

@ngoldbaum I've reinstated a bunch of tests where feasible.

crusaderky avatar Nov 21 '25 12:11 crusaderky

As discussed offline with @ngoldbaum , I've removed the legacy subpackages srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, and srsly.ujson. This is a breaking change.

crusaderky avatar Nov 24 '25 17:11 crusaderky

I tested this with spaCy built from-source using all of the latest versions of our WIP work on the full stack with Python 3.13. See this gist for my full environment from pip freeze.

All the tests pass: https://gist.github.com/ngoldbaum/8878de662dcf55b797613145b07ecea6

ngoldbaum avatar Dec 05 '25 16:12 ngoldbaum

@honnibal does the spaCy test suite passing with this PR and the discussion above assuage your concerns?

ngoldbaum avatar Dec 05 '25 16:12 ngoldbaum