pysimdjson
pysimdjson copied to clipboard
Prove we're worth using in real-world cases
We should experiment adding simdjson to some real-world projects where performance matters. This is both to prove we're worth using, and it ensure our API is extensive enough for real-world problems, extending it where needed. Basically "success stories". If you've used pysimdjson successfully, please feel free to contribute.
-
@ericls's https://github.com/fellowinsights/prosemirror-py
- An ~8% gain on tiny documents (the document found in the project's example.py) from just switching
import jsontoimport simdjson. before:
------------------------------------------------- benchmark: 1 tests ------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations ---------------------------------------------------------------------------------------------------------------------- test_decoding_steps 26.0000 78.1000 28.3224 4.1256 27.3000 0.7000 559;930 35.3077 7571 1 ----------------------------------------------------------------------------------------------------------------------after:
------------------------------------------------- benchmark: 1 tests ------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations ---------------------------------------------------------------------------------------------------------------------- test_decoding_steps 24.2000 74.5000 25.6597 2.8573 25.0000 0.5000 365;674 38.9716 5702 1 ----------------------------------------------------------------------------------------------------------------------Our high max time is the one-time overhead from selecting algorithm implementation, which can be avoided.
- On more realistic documents with many attributes that the server doesn't care about, and using the simdjson-specific API for lazy dicts, the speed gain was drastic, seeing a 4-8x increase on a synthetic test documents. However, minimal documents performed much worse, as every key was accessed anyways. Both are things that could be improved by rewriting parts of prosemirror-py to be simdjson-aware rather than patching it in. Additionally, there are a few checks for
isinstance(x, list)that needed to be updated, as our Array type isn't a true list.
- An ~8% gain on tiny documents (the document found in the project's example.py) from just switching
-
Kinto is Mozilla's simple key-value database used for some production services like bookmark syncing. It's JSON performance is abysmal, and although it (sometime) uses ujson it does some odd bug workarounds. As an example using the memory backend will cause a
JSON decode (built-in) -> encode (built-in) -> decode (ujson) -> encode (built-in)when creating (https://github.com/Kinto/kinto/blob/5f8ba312d0af8cac8d6f2ee5371bd26d5501be7e/kinto/core/storage/memory.py#L205) in an attempt to fix an issue where keys might be byte strings. Sure kinto isn't trying to be speedy, but this is silly and we can improve it. WIP.
https://github.com/elastic/rally/issues/1046
I'd be happy to try out simdjson in Kinto :)
Just to clarify:
- Kinto is not really a key-value database. But more like a «remote JSON storage»
- it's not used for bookmark syncing, but for something called «remote settings»
- the code that you point out is indeed quite bad. Fortunately this is a dumb in-memory backend used for dev only. I'd be happy to know if you spotted other places with ugly stuff!
You're right that JSON de/ser is critical in Kinto. simdjson could definitely help improve the perfs! But from the Mozilla standpoint, we may not be able to provide you a concrete production example where simdjson brings a lot of value, because most of the API calls are read-only (serialization only) and behind a CDN.
Thanks for the clarifications @leplatrem! I'm not sure why I thought Kinto was used for the bookmark sync.