fix: add parse_string_if_needed function
This is a competing PR vs https://github.com/simdjson/simdjson/pull/2211 by @CarlosEduR
The idea is that we avoid copying strings to a string buffer when we do not need to.
@CarlosEduR uses a sensible approach that does not require 'deep' changes. This PR is somewhat deeper. Whether this PR is better than https://github.com/simdjson/simdjson/pull/2211 is an open question.
Note that @CarlosEduR checks for the null termination which is unnecessary work. Removing this check could change the story.
partial fix to https://github.com/simdjson/simdjson/issues/1470 Note that @jkeiser's idea is somewhat more involved than what we are doing currently.
Benchmarks on my ARM processor (Apple M2).
partial tweek
In this benchmark, we typically do not have escaped content in strings, but it does happen from time to time. I don't have an exact percentage but it is maybe 20% or 10% of the time that we have escaped content.
Using ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>". Run with sudo to get performance counters.
Main: best_instructions_per_byte=8.68026 best_instructions_per_cycle=6.57819
@CarlosEduR's PR: best_instructions_per_byte=9.31405 best_instructions_per_cycle=6.54882
This PR: best_instructions_per_byte=8.66243 best_instructions_per_cycle=6.5704
find tweet
This is a lucky benchmark where we never have escaped content to worry about.
Using ./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>". Run with sudo to get performance counters.
Main best_instructions_per_byte=4.71785 best_instructions_per_cycle=6.34339
@CarlosEduR's PR: best_instructions_per_byte=4.7256 best_instructions_per_cycle=6.34974
This PR: best_instructions_per_byte=4.71761 best_instructions_per_cycle=6.34714
Conclusion
It is too early to tell which direction this goes because (1) I only tested on one system and (2) only on two benchmarks.
ARM systems do not have to contend with runtime dispatches, so this is an advantage for this PR, compared to @CarlosEduR's PR. However, @CarlosEduR's PR could do better when runtime dispatching is needed.
@CarlosEduR's PR shows a regression which is possibly caused by the fact that it tries to avoid the copy, fails and then has to fall back on the current code. Even if it only happens one time out of 5 or 10, these unlucky cases could cost you.
@CarlosEduR's PR could be further optimized and the story might change.
Overall, my preliminary results suggest that on Apple Silicon, it is not worth avoiding a write on the string buffer.
This is an excellent PR, @lemire! I appreciate the dedication, really nice ~jab~ job!
I am using a AMD64, I'll run the benckmarks locally.
I appreciate the dedication
It is not super difficult, thankfully... but I am somewhat disappointed so far that we don't see much of an effect.
I am using a AMD64, I'll run the benckmarks locally.
That would be great. Please consider optimizing your own code (removing the != '\0') as this could change the story.
partial_tweets<simdjson_ondemand>
Daniel's PR:
best_instructions_per_byte=3.31552 best_instructions_per_cycle=3.22185
Master Branch:
best_instructions_per_byte=3.30848 best_instructions_per_cycle=3.27537
Carlos' PR
best_instructions_per_byte=3.6775 best_instructions_per_cycle=3.27884
find_tweet<simdjson_ondemand>
Master branch:
best_instructions_per_byte=2.28167 best_instructions_per_cycle=3.24331
Daniel's PR:
best_instructions_per_byte=2.28163 best_instructions_per_cycle=3.23015
Carlos' PR:
best_instructions_per_byte=2.28611 best_instructions_per_cycle=3.1972
I've not updated my code yet (removing the != '\0'), will do it and will share results.
@CarlosEduR
I've not updated my code yet (removing the != '\0'), will do it and will share results.
Did you get around to it?
Did you get around to it?
yes! partial_tweets and find_tweet benchmarks:
best_instructions_per_byte=3.58215 best_instructions_per_cycle=3.15066
best_instructions_per_byte=2.28551 best_instructions_per_cycle=3.25264
@CarlosEduR
Hmmm, did the number of instructions go up in your find_tweet benchmark?
Just so we are clear, the idea was to reduce the amount of unnecessary work in your PR (check for terminating null). This should improve the performance and reduce the instruction count?