pysimdjson Prove we're worth using in real-world cases

We should experiment adding simdjson to some real-world projects where performance matters. This is both to prove we're worth using, and it ensure our API is extensive enough for real-world problems, extending it where needed. Basically "success stories". If you've used pysimdjson successfully, please feel free to contribute.

@ericls's https://github.com/fellowinsights/prosemirror-py

An ~8% gain on tiny documents (the document found in the project's example.py) from just switching import json to import simdjson. before:

  ------------------------------------------------- benchmark: 1 tests -------------------------------------------------
  Name (time in us)           Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
  ----------------------------------------------------------------------------------------------------------------------
  test_decoding_steps     26.0000  78.1000  28.3224  4.1256  27.3000  0.7000   559;930       35.3077    7571           1
 ----------------------------------------------------------------------------------------------------------------------

after:

  ------------------------------------------------- benchmark: 1 tests -------------------------------------------------
  Name (time in us)           Min      Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
  ----------------------------------------------------------------------------------------------------------------------
  test_decoding_steps     24.2000  74.5000  25.6597  2.8573  25.0000  0.5000   365;674       38.9716    5702           1
  ----------------------------------------------------------------------------------------------------------------------

Our high max time is the one-time overhead from selecting algorithm implementation, which can be avoided.

On more realistic documents with many attributes that the server doesn't care about, and using the simdjson-specific API for lazy dicts, the speed gain was drastic, seeing a 4-8x increase on a synthetic test documents. However, minimal documents performed much worse, as every key was accessed anyways. Both are things that could be improved by rewriting parts of prosemirror-py to be simdjson-aware rather than patching it in. Additionally, there are a few checks for isinstance(x, list) that needed to be updated, as our Array type isn't a true list.

Kinto is Mozilla's simple key-value database used for some production services like bookmark syncing. It's JSON performance is abysmal, and although it (sometime) uses ujson it does some odd bug workarounds. As an example using the memory backend will cause a JSON decode (built-in) -> encode (built-in) -> decode (ujson) -> encode (built-in) when creating (https://github.com/Kinto/kinto/blob/5f8ba312d0af8cac8d6f2ee5371bd26d5501be7e/kinto/core/storage/memory.py#L205) in an attempt to fix an issue where keys might be byte strings. Sure kinto isn't trying to be speedy, but this is silly and we can improve it. WIP.

Aug 06 '20 01:08 TkTech

https://github.com/elastic/rally/issues/1046

Aug 10 '20 17:08 TkTech

I'd be happy to try out simdjson in Kinto :)

Just to clarify:

Kinto is not really a key-value database. But more like a «remote JSON storage»
it's not used for bookmark syncing, but for something called «remote settings»
the code that you point out is indeed quite bad. Fortunately this is a dumb in-memory backend used for dev only. I'd be happy to know if you spotted other places with ugly stuff!

You're right that JSON de/ser is critical in Kinto. simdjson could definitely help improve the perfs! But from the Mozilla standpoint, we may not be able to provide you a concrete production example where simdjson brings a lot of value, because most of the API calls are read-only (serialization only) and behind a CDN.

Aug 11 '20 12:08 leplatrem

Thanks for the clarifications @leplatrem! I'm not sure why I thought Kinto was used for the bookmark sync.

Aug 11 '20 17:08 TkTech

pysimdjson pysimdjson copied to clipboard

Prove we're worth using in real-world cases

pysimdjson
pysimdjson copied to clipboard