pysimdjson
pysimdjson copied to clipboard
v7.0.0
This is a major breaking change release that removes Array and Object proxies. However, after checking all GitHub repos that have this one as a dependency with > 5 stars, only 2 were using these features. They were generally an anti-pattern - if you needed 1 value, use at_pointer()
instead. If you needed more than 1 value, it was almost always faster to use at_pointer()
for an entire object at once. This new approach also alleviates memory management issues on PyPy.
If all you used was simdjson.loads() and simdjson.parse(), you should notice no difference.
- Drop Python 3.6 and 3.7, which are now beyond end-of-life. Add Python 3.11.
- Exploits CPython Unicode object internals for significantly faster string creation (up to 45%!)
- Removed Array and Object proxy objects.
- Changing our approach to this has significantly improved memory safety internally and fixed pypy support.
- Update deprecated github actions.
- Update vendored simdjson to version 3.2.3.
- Minified floats no longer drop the
.0
(see #102)
- Minified floats no longer drop the
ToDo:
- [ ] Re-add JSON-to-buffer/numpy array removed in initial cleanup (this method is many times faster than naively loading JSON when trying turning a homogeneous array of JSON values into a numpy array)
- [ ] Add support for latest PyPy
- [ ] Memory optimization pass
- [ ] Update documentation and examples.
For certain benchmarks, especially those that are string-heavy, this version is now roughly 45% faster.
---------------------------------------------------------------------- benchmark 'Complete load of data/twitter.json': 2 tests -----------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
simdjson (NOW) 916.4979 (1.0) 4,188.6930 (1.0) 1,011.2369 (1.0) 391.3896 (1.0) 939.6220 (1.0) 24.2511 (1.0) 16;88 988.8880 (1.0) 690 1
simdjson (OLD) 1,328.0310 (1.45) 4,533.0260 (1.08) 1,428.9499 (1.41) 414.6389 (1.06) 1,355.7710 (1.44) 31.2135 (1.29) 12;49 699.8146 (0.71) 507 1
I am using pysimdjson
to do work like this:
doc = Parser().parse('{"a": {"b": ...}}')
b = doc['a']['b']
s = b.mini
The contents under b
are huge, and pysimdjson
allows me to avoid creating Python objects of them.
With the new changes you destroy this feature. Now at_pointer
constructs the Python objects forcefully.
I propose at_pointer
returns the Document
object, and the Document
object implements the mini
property or method. (I did not find how mini
is even accessable in the current code.) With these changes, the sample code above can be rewritten to:
doc = Parser().parse('{"a": {"b": ...}}')
b = doc.at_pointer('/a/b')
s = b.mini
But this still requires one to know the full path.
I am also using pysimdjson
as follows:
doc = Parser().parse('{"a": [{"x": ...}, ...]}')
items = list(doc['a'])
for item in items:
item[y] = ...
s = deep_jsonify(items) # uses .mini when possible
First of all, the drop-in functionality of read-only list
and dict
structures is very nice here. Second, the new Document
does not offer any way to list items at all, without creating Python objects for the full json subtree. If you hate the Array
and Object
classes, maybe Document.parse_shallow
, which returns the Python element, which, in the case of being list
lists Document
objects, etc for dict
?
P.S. Document.root
and Document.as_object
are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.
Thanks for the feedback @edgarsi, appreciate it.
With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.
This PR won't be merged until it's back to feature parity with v5. The Array and Object interfaces have to disappear for memory safety. While there are a bunch of ways to make it "safe", they come at a severe performance penalty for small documents. They also tended to be used to access more than a key or two, which is often slower than just getting the entire object.
I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:
Most of the methods on Document()
will mimic their counterparts in py_yyjson, where every method can take a pointer. .mini
will become mini(at_pointer: str = /a/b)
. You'll actually see a bit of a speed boost and slightly better memory usage.
list lists Document objects, etc for dict?
1 JSON Document will return 1 Document()
object. It's a memory container, not meant to represent a simd::element
. The list, dict, and numpy helpers will be back before this is merged. Proxy objects cannot be used safely used in Python, because the Document()
may have been reused between calls. All methods in v6 will return Python objects.
P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.
This was already fixed locally. root()
isn't exposed to Python, it's a cdef to return the document root for internal functions.