cc-pyspark
cc-pyspark copied to clipboard
Use simdjson to read WAT payloads
Simdjson (pysimdjson) should be faster than ujson when parsing WAT payloads. Could be worth to use it as a drop-in replacement if installed (cf. #34 regarding ujson replacing the built-in json module).
Do we want to exclusively support pysimdjson, or should we consider implementing adapter classes to support multiple parsers? This would allow users to switch parsers at runtime, similar to what is proposed for HTML parsers in PR #47 .
As Sebastian alludes to by mentioning #34, we like having a fallback. I see that pysimdjson claims that it has a fallback internally, but you can still run into weirdnesses like a package lacking a wheel causing problems in CI.
@wumpus Thanks for the comment!
I wanted to add a key point that pysimdjson provides a highly performant API, as detailed here: pysimdjson Performance
These APIs, such as parse offer significant performance benefits by creating objects only during access. However, these optimizations are unavailable in drop-in replacement methods like simdjson.loads().
Summary of Options
-
Compatibility-Focused Approach:
- If compatibility is the primary goal with a slight performance boost, using the drop-in API (
simdjson.loads()) makes sense.
- If compatibility is the primary goal with a slight performance boost, using the drop-in API (
-
Aliasing Parse as Loads:
- We could alias the
parsefunction asloadsto maintain compatibility. However, this feels hacky, for example:
def get_json_parser(): try: import simdjson parser = simdjson.Parser() return parser.parse except ImportError: import ujson return ujson.loads # Sample usage loads = get_json_parser() # Now you can use `json_parser` to parse JSON data data = loads() - We could alias the
-
Performance-Focused Approach:
- If performance is the goal, it would be better to expose the choice of parser through a configuration option. This allows users to explicitly choose
pysimdjsonfor its advanced APIs while acknowledging its limitations (e.g., Issue #72, where incorrect use of performant APIs can lead to penalties).
- If performance is the goal, it would be better to expose the choice of parser through a configuration option. This allows users to explicitly choose
This way, we balance compatibility and performance while letting users decide what works best for their needs. I prefer Option 3 as it is a much cleaner approach if going for performance
Thoughts?
Hi @silentninja,
- ... compatibility is the primary goal
Yes. Looks like that simdjson does not support every combination of the matrix (OS, platform), see https://pysimdjson.tkte.ch/: Mac OS on ARM is not supported. This already happened in the past with ujson (#34). A working fall-back is always required.
- Aliasing Parse as Loads:
No. It adds extra complexity on every WAT record and does not the maximum performance, see the notes about re-using the simdjson parser.
- ... performance is the goal
This can still be achieved based on inheritance. For example, several example classes have a variant using FastWARC instead of warcio, see #37/#38. But if you want it simple or stay compatible, you can always use the classes based on warcio. Of course, with regard to simdjson, it only makes sense to implement a performant solution for classes which consume JSON resp. WAT files, and are used not only as simple example. So, I could imagine to implement it in ExtractHostLinksFastWarcJob because this class is used by Common Crawl every month to extract host-level links to span up the web graph.
Yes. Looks like that simdjson does not support every combination of the matrix (OS, platform), see https://pysimdjson.tkte.ch/: Mac OS on ARM is not supported. This already happened in the past with ujson (https://github.com/commoncrawl/cc-pyspark/issues/34). A working fall-back is always required.
Just noting that although it doesn't appear in the grid, it is supported and universal ARM/x86 wheels are published. Oversight on my part, grid will be updated with the next release.
If portability is your primary concern, https://github.com/tktech/py_yyjson performs much the same as pysimdjson while being standard C89, and has binary wheels available for all platforms with wheel tags.
See https://github.com/tktech/json_benchmark for comparisons of most popular parser.
After trying out simdjson, there are a few pitfalls that make me hesitant to use simdjson as a direct alternative to the standard json module.
-
The API of the simdjson object is different from the object created by the json module. This can lead to errors like https://github.com/TkTech/pysimdjson/issues/122.
-
Unlike the json module, the simdjson object requires explicit deallocation. If not properly managed, it can lead to errors, as it’s more sensitive to memory management.
-
The parser returned by simdjson is not serializable, which can cause exceptions if not handled correctly.
While none of the mentioned issues don't break the existing functions but if we intent to go with simdjson, the tasks should be written primarily for the stricter simdjson while using json as a fallback. Is that acceptable for this project?
@sebastian-nagel I went forward with using pysimdjson in #49. I didn't use https://github.com/tktech/py_yyjson as the API is quite different to the standard json python module which might not be familiar to everyone.
Benchmarked the server counts and extract links jobs...
- using FastWARC (#37) gives a huge performance gain: 50% on the server count and 16% for link extraction
- if pysimdjson is used as drop-in replacement (
simdjson.loads/ recursive parsing), there is still a gain compared to ujson - however, orjson performs better if used as a drop-in replacement
- there can be a significant gain if pysimdjson is integrated not as drop-in but with reuse of the parser and lazy (non-recursive) parsing. This requires also additional changes because the API is not compatible, see #49 (observations by @silentninja)
- clearly visible for the server counts: 40% faster than the
simdjson.loadsvariant and still 30% faster than using orjson - not visible for the link extraction: here we need to read the entire JSON blob, lazy parsing has no effect
- clearly visible for the server counts: 40% faster than the
Recommendations from the benchmarking:
- consider to switch from ujson to orjson
- implement a more performant variant for the ServerCountFastWarcJob using simdjson but not as drop-in replacement
Comments welcome!
Below the detailed benchmark results
- running Spark 3.5.5 (local mode)
- on CC-MAIN-20250430184734-20250430214734-00059.warc.wat.gz
- total time measured with option
--spark-profilerfor all routines (not only JSON parsing) - best time of 2-3 measurements
- jobs
- (A) ServerCountJob
- (B) ServerCountFastWarcJob
- (C) ExtractHostLinksJob
- (D) ExtractHostLinksFastWarcJob
| AMD Ryzen 7 8845HS | (A) | (B) | (C) | (D) |
|---|---|---|---|---|
| json | 12.256 | 6.455 | 37.438 | 30.588 |
| ujson | 10.865 | 4.980 | 35.686 | 29.943 |
| orjson | 10.054 | 4.267 | 34.683 | 28.510 |
| simdjson.loads | 10.960 | 5.037 | 35.632 | 29.626 |
| simdjson (lazy) | 8.667 | 2.974 | 36.475 | 29.480 |
| ARM (AWS r8g.large) | ||||
| json | 16.719 | 9.073 | 49.683 | 41.628 |
| ujson | 15.423 | 7.639 | 48.488 | 40.452 |
| orjson | 13.895 | 6.316 | 46.229 | 38.488 |
| simdjson.loads | 15.227 | 7.524 | 47.977 | 39.813 |
| simdjson (lazy) | 12.379 | 4.872 | 48.706 | 39.813 |
See https://github.com/tktech/json_benchmark for comparisons of most popular parser.
@TkTech, this seems at large congruent with your benchmark.