Unvendor all libraries
- Closes #116
- Closes #83
- Closes #89
- Supersedes #110
- Supersedes #111
- Supersedes #117
High level changes
In an effort to reduce the maintainance burden, unvendor all libraries except msgpack-python and upgrade them to the latest available version, with sane but reasonable version pins.
This causes the previously vendored libraries to be bumped up several years all at once. For this reason, I'd rather not merge this PR until we gather some confidence that it doesn't introduce regressions down the line.
This PR causes srsly to become a much simpler pure-python package.
Breaking changes
This PR removes srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, srsly.ujson. This could cause breakages downstream. The trivial fix is to simply use msgpack, cloudpickle, ruamel.yaml, and ujson directly.
Free-threading (noGIL)
This PR does not, as of today, make srsly compatible with free-threading, but it will automatically do so as soon as the below issues will be fixed upstream and new upstream releases become available:
- https://github.com/msgpack/msgpack-python/issues/613
- https://github.com/ultrajson/ultrajson/issues/684
Other changes
- Bump version to 3.0, in an effort to avoid accidental bumps in downstream projects.
- Fix ujson segfault in Python 3.14 (see #117)
- Introduce a minor change in json output, where dicts gain a whitespace between key and value (from
{'x':1}to{'x': 1}. This should be purely cosmetic but may cause some downstream unit tests to trivially fail. - Add CI coverage for the use case where numpy is not installed
msgpack-specific changes
msgpack-python could not be unvendored. The reasons are that the latest upstream version of it
- is quite old and uses a hacky patch system that offers no guarantee to remain feasible vs. future versions of msgpack, which is instead actively developed
- introduces a regression where it fails to round-trip
np.float64, due to it being a subclass offloat - introduces a hard dependency to
numpy
So instead I heavily reworked the fork that we have and added extra tests.
I also wrote from scratch a system for third-party msgpack extensions that is compatible with the previously vendored system, which is no longer available upstream.
Additionally:
- Fix bug where built-in
complexcould not be serialized by msgpack unlessnumpyis installed - Fix bug where gc would remain permanently disabled if msgpack decode fails for any reason, e.g. in case of corrupted stream or if a third party extension raises.
- Change exception when trying to decode a numpy object without having numpy installed from a nebulous AttributeError to a clearer ModuleNotFoundError
FWIW, I mentioned having concerns about the library that led to this decision to vendor it and remove these unsafe branches. I found the message where I explained this to our team at the time:
So YAML has this infamous vulnerability where if you are so foolish as to write yaml.load as opposed to yaml.safe_load, you allow arbitrary code execution, as the YAML message may contain pickled Python objects. So consider these lines: https://github.com/pycontribs/ruamel-yaml/blob/master/main.py#L1392
if issubclass(Loader, BaseLoader):BaseConstructor.add_constructor(tag, object_constructor)elif issubclass(Loader, SafeLoader):SafeConstructor.add_constructor(tag, object_constructor)Looks...alarming, right? Surely the order of conditions is wrong, so it will never enter SafeLoader :scream: Ah but wait...silly me. Why would I think
SafeLoaderis a subclass ofBaseLoader? In factSafeLoaderhas 99 base classes butBaseLoaderain't one:
class SafeLoader(Reader, Scanner, Parser, Composer, SafeConstructor, VersionedResolver):
This wasn't originally written as a public message so it's more flippant than I'd generally be --- no code is perfect, and this relates to stylistic concerns that are matters of judgment. But my concern was that future versions of the library could easily introduce a regression that affected the safe loading feature.
As I said on the call I can agree to unvendor the code as the cost/benefit analysis on the maintenance burden is different now. But I definitely want to make sure we maintain the behaviours that we had before, of forcing the library to only work in safe mode. A test that checked that the code execution features are in fact disabled would also be appreciated.
Not sure I understand your worry here. It seems to me that to get a code injection one needs to
- deliberately bypass the
srlsyAPI and directly use eitherruamel.yamlor the backards compatibility aliassrsly.yaml, and - explicitly opt in with
typ='unsafe'.
I too remember the time where yaml.load was unsafe by default, but that was many years ago.
>>> import io
>>> import srsly
>>> from ruamel.yaml import YAML
>>> class C:
... def __new__(cls):
... print("Arbitrary code execution!")
... return object.__new__(cls)
>>> c = C()
Arbitrary code execution!
>>> srsly.yaml_dumps(c)
RepresenterError: cannot represent an object: <__main__.C object at 0x724fd4819e10>
>>> buf = io.StringIO()
>>> yaml = YAML(typ='unsafe')
>>> yaml.dump(c, buf)
>>> buf.getvalue()
'!!python/object:__main__.C {}\n'
>>> buf.seek(0)
>>> yaml.load(buf)
Arbitrary code execution!
<__main__.C at 0x724f75c75050>
>>> yaml2 = YAML()
>>> buf.seek(0)
>>> yaml2.load(buf)
{}
>>> srsly.yaml_loads(buf.getvalue())
ValueError: Invalid YAML: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:__main__.C'
in "<unicode string>", line 1, column 1:
!!python/object:__main__.C {}
^ (line: 1)
I've added a unit test to verify this behaviour.
Just because I'm not familiar with the package: are the tests all inherited from upstream or did you do the edits to the tests manually?
All surviving tests are unique to srsly. All changes in them are my own edits. I've deleted all tests that were inherited from upstream.
@ngoldbaum I've reinstated a bunch of tests where feasible.
As discussed offline with @ngoldbaum , I've removed the legacy subpackages srsly.msgpack, srsly.cloudpickle, srsly.ruamel_yaml, and srsly.ujson. This is a breaking change.
I tested this with spaCy built from-source using all of the latest versions of our WIP work on the full stack with Python 3.13. See this gist for my full environment from pip freeze.
All the tests pass: https://gist.github.com/ngoldbaum/8878de662dcf55b797613145b07ecea6
@honnibal does the spaCy test suite passing with this PR and the discussion above assuage your concerns?