pysimdjson icon indicating copy to clipboard operation
pysimdjson copied to clipboard

v7.0.0

Open TkTech opened this issue 1 year ago • 3 comments

This is a major breaking change release that removes Array and Object proxies. However, after checking all GitHub repos that have this one as a dependency with > 5 stars, only 2 were using these features. They were generally an anti-pattern - if you needed 1 value, use at_pointer() instead. If you needed more than 1 value, it was almost always faster to use at_pointer() for an entire object at once. This new approach also alleviates memory management issues on PyPy.

If all you used was simdjson.loads() and simdjson.parse(), you should notice no difference.

  • Drop Python 3.6 and 3.7, which are now beyond end-of-life. Add Python 3.11.
  • Exploits CPython Unicode object internals for significantly faster string creation (up to 45%!)
  • Removed Array and Object proxy objects.
    • Changing our approach to this has significantly improved memory safety internally and fixed pypy support.
  • Update deprecated github actions.
  • Update vendored simdjson to version 3.2.3.
    • Minified floats no longer drop the .0 (see #102)

ToDo:

  • [ ] Re-add JSON-to-buffer/numpy array removed in initial cleanup (this method is many times faster than naively loading JSON when trying turning a homogeneous array of JSON values into a numpy array)
  • [ ] Add support for latest PyPy
  • [ ] Memory optimization pass
  • [ ] Update documentation and examples.

TkTech avatar Sep 03 '23 01:09 TkTech

For certain benchmarks, especially those that are string-heavy, this version is now roughly 45% faster.

---------------------------------------------------------------------- benchmark 'Complete load of data/twitter.json': 2 tests -----------------------------------------------------------------------
Name (time in us)             Min                   Max                  Mean              StdDev                Median                IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
simdjson (NOW)           916.4979 (1.0)      4,188.6930 (1.0)      1,011.2369 (1.0)      391.3896 (1.0)        939.6220 (1.0)      24.2511 (1.0)         16;88  988.8880 (1.0)         690           1
simdjson (OLD)     1,328.0310 (1.45)     4,533.0260 (1.08)     1,428.9499 (1.41)     414.6389 (1.06)     1,355.7710 (1.44)     31.2135 (1.29)        12;49  699.8146 (0.71)        507           1

TkTech avatar Sep 04 '23 05:09 TkTech

I am using pysimdjson to do work like this:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc['a']['b']
s = b.mini

The contents under b are huge, and pysimdjson allows me to avoid creating Python objects of them.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc.at_pointer('/a/b')
s = b.mini

But this still requires one to know the full path.

I am also using pysimdjson as follows:

doc = Parser().parse('{"a": [{"x": ...}, ...]}')
items = list(doc['a'])
for item in items:
    item[y] = ...
s = deep_jsonify(items)  # uses .mini when possible

First of all, the drop-in functionality of read-only list and dict structures is very nice here. Second, the new Document does not offer any way to list items at all, without creating Python objects for the full json subtree. If you hate the Array and Object classes, maybe Document.parse_shallow, which returns the Python element, which, in the case of being list lists Document objects, etc for dict?

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

edgarsi avatar Sep 05 '23 21:09 edgarsi

Thanks for the feedback @edgarsi, appreciate it.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

This PR won't be merged until it's back to feature parity with v5. The Array and Object interfaces have to disappear for memory safety. While there are a bunch of ways to make it "safe", they come at a severe performance penalty for small documents. They also tended to be used to access more than a key or two, which is often slower than just getting the entire object.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

Most of the methods on Document() will mimic their counterparts in py_yyjson, where every method can take a pointer. .mini will become mini(at_pointer: str = /a/b). You'll actually see a bit of a speed boost and slightly better memory usage.

list lists Document objects, etc for dict?

1 JSON Document will return 1 Document() object. It's a memory container, not meant to represent a simd::element. The list, dict, and numpy helpers will be back before this is merged. Proxy objects cannot be used safely used in Python, because the Document() may have been reused between calls. All methods in v6 will return Python objects.

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

This was already fixed locally. root() isn't exposed to Python, it's a cdef to return the document root for internal functions.

TkTech avatar Sep 06 '23 01:09 TkTech