jaq icon indicating copy to clipboard operation
jaq copied to clipboard

Slow file reading

Open Shatur opened this issue 3 years ago • 11 comments

Description:

I found a use case where jaq is slower:

изображение изображение

Commands:

jaq .features[10000].properties.LOT_NUM < citylots.json 
jq -cM .features[10000].properties.LOT_NUM < citylots.json 

JSON file

Shatur avatar Apr 27 '22 11:04 Shatur

Thanks for your example. I can see that there are lots of decimal numbers in the input, and preserving these perfectly as strings may slow down processing. I will however probably only get around to look at your example in more detail in a few weeks.

01mf02 avatar Apr 29 '22 15:04 01mf02

Ok, I have looked a bit at your test file. I found the following: Reading the file with $(JQ) 'empty' < citylots.json takes about 9.3s for jq and 10.9s for jaq. Calculating with $(JQ) .features[10000].properties.LOT_NUM < citylots.json takes 9.6s for jq and 11.0s for jaq. So calculating the output value is actually quite cheap compared to loading the file.

Just loading the JSON file with serde_json (the JSON parser used by jaq) takes only 9.5s. So the conversion from the serde_json format to jaq's format is responsible for the overhead. Unfortunately, this conversion is currently necessary, because serde_json can only deserialize to values in its own format.

I believe that this overhead could be eliminated if we could make serde_json more flexible such that it can deserialize to any type that implements certain traits, such as From<bool>. This way, we could avoid the conversion step between serde_json and jaq, because serde_json could just directly create values in jaq's format. At the moment, however, I do not know how feasible it is to make this change to serde_json.

01mf02 avatar Jun 07 '22 08:06 01mf02

I made another quick experiment: I mapped all numbers contained in serde_json values to constant 0 in the output jaq value. This way, $(JQ) 'empty' < citylots.json takes 10.4s. That means that even if we have very fast conversion of numbers, just converting the whole JSON values once still takes nearly one second. In conclusion, the serde_json flexibilisation seems indeed necessary to be competitive with jq.

01mf02 avatar Jun 07 '22 08:06 01mf02

Oh, I see, interesting investigation. Maybe there is some alternatives for deserealization?

Shatur avatar Jun 07 '22 08:06 Shatur

Yes. One crazy idea: Use simd_json for high-performance parsing. I also stumbled over ijson. I just tried to integrated simd_json into jaq, but failed, because I use serde_json::Deserializer::into_iter, but simd_json::Deserializer does not seem to have such functionality. :(

Furthermore, reading https://github.com/serde-rs/json-benchmark suggests that maybe there are not such enormous gains to be had by using another JSON parser ...

@Shatur, if you find a way to parse your JSON data faster than serde_json using any Rust library such as simd_json / ijson / ..., then I would be willing to try to integrate that method into jaq. In the end, I believe that only the read_json function in main.rs would need to be changed (at least if you manage to get simd_json running).

01mf02 avatar Jun 07 '22 09:06 01mf02

I just tried to integrated simd_json into jaq, but failed, because I use serde_json::Deserializer::into_iter, but simd_json::Deserializer does not seem to have such functionality. :(

Maybe it related to the use of SIMD?

suggests that maybe there are not such enormous gains to be had by using another JSON parser ...

This is true, but if I understand correctly, the main problem is the flexibility of serde_json.

Shatur avatar Jun 07 '22 10:06 Shatur

This is true, but if I understand correctly, the main problem is the flexibility of serde_json.

Yes, but if the JSON parser is much faster, then even with the conversion overhead imposed by the non-flexibility of serde_json, jaq could be faster than before. But again, this depends on how much faster parsing can really become.

01mf02 avatar Jun 07 '22 12:06 01mf02

Sure, I mean that we could try any different library, which could be even a little slower, but more flexible.

Shatur avatar Jun 07 '22 14:06 Shatur

I found that after all, serde_json seems to provide the flexibility that we need ... mostly. I just have not figured out yet how to deal with arbitrary precision numbers.

01mf02 avatar Jun 07 '22 15:06 01mf02

I found that after all, serde_json seems to provide the flexibility that we need ... mostly.

Great!

Shatur avatar Jun 07 '22 18:06 Shatur

Don't forget the "mostly" part. ;)

01mf02 avatar Jun 08 '22 07:06 01mf02

See #50.

01mf02 avatar Nov 08 '22 16:11 01mf02

The new parser based on hifijson improves performance by 87% compared to jaq-0.9.0 and by 68% compared to jq for this example:

hyperfine -L jq ./jq-ndebug,./jaq-0.9.0,./jaq-0.10.0 '{jq} empty ~/Downloads/citylots.json'
Benchmark #1: ./jq-ndebug empty ~/Downloads/citylots.json
  Time (mean ± σ):      6.266 s ±  0.147 s    [User: 5.240 s, System: 1.024 s]
  Range (min … max):    5.951 s …  6.475 s    10 runs
 
Benchmark #2: ./jaq-0.9.0 empty ~/Downloads/citylots.json
  Time (mean ± σ):      6.988 s ±  0.178 s    [User: 6.003 s, System: 0.983 s]
  Range (min … max):    6.704 s …  7.281 s    10 runs
 
Benchmark #3: ./jaq-0.10.0 empty ~/Downloads/citylots.json
  Time (mean ± σ):      3.738 s ±  0.076 s    [User: 2.994 s, System: 0.742 s]
  Range (min … max):    3.641 s …  3.881 s    10 runs
 
Summary
  './jaq-0.10.0 empty ~/Downloads/citylots.json' ran
    1.68 ± 0.05 times faster than './jq-ndebug empty ~/Downloads/citylots.json'
    1.87 ± 0.06 times faster than './jaq-0.9.0 empty ~/Downloads/citylots.json'

Here, jq-ndebug is jq-cff5336 without debugging info.

I therefore think that the issue of slow file reading is resolved. :)

01mf02 avatar Mar 06 '23 11:03 01mf02