simdjson fix: add parse_string_if

This is a competing PR vs https://github.com/simdjson/simdjson/pull/2211 by @CarlosEduR

The idea is that we avoid copying strings to a string buffer when we do not need to.

@CarlosEduR uses a sensible approach that does not require 'deep' changes. This PR is somewhat deeper. Whether this PR is better than https://github.com/simdjson/simdjson/pull/2211 is an open question.

Note that @CarlosEduR checks for the null termination which is unnecessary work. Removing this check could change the story.

partial fix to https://github.com/simdjson/simdjson/issues/1470 Note that @jkeiser's idea is somewhat more involved than what we are doing currently.

Benchmarks on my ARM processor (Apple M2).

partial tweek

In this benchmark, we typically do not have escaped content in strings, but it does happen from time to time. I don't have an exact percentage but it is maybe 20% or 10% of the time that we have escaped content.

Using ./build/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>". Run with sudo to get performance counters.

Main: best_instructions_per_byte=8.68026 best_instructions_per_cycle=6.57819

@CarlosEduR's PR: best_instructions_per_byte=9.31405 best_instructions_per_cycle=6.54882

This PR: best_instructions_per_byte=8.66243 best_instructions_per_cycle=6.5704

find tweet

This is a lucky benchmark where we never have escaped content to worry about.

Using ./build/benchmark/bench_ondemand --benchmark_filter="find_tweet<simdjson_ondemand>". Run with sudo to get performance counters.

Main best_instructions_per_byte=4.71785 best_instructions_per_cycle=6.34339

@CarlosEduR's PR: best_instructions_per_byte=4.7256 best_instructions_per_cycle=6.34974

This PR: best_instructions_per_byte=4.71761 best_instructions_per_cycle=6.34714

Conclusion

It is too early to tell which direction this goes because (1) I only tested on one system and (2) only on two benchmarks.

ARM systems do not have to contend with runtime dispatches, so this is an advantage for this PR, compared to @CarlosEduR's PR. However, @CarlosEduR's PR could do better when runtime dispatching is needed.

@CarlosEduR's PR shows a regression which is possibly caused by the fact that it tries to avoid the copy, fails and then has to fall back on the current code. Even if it only happens one time out of 5 or 10, these unlucky cases could cost you.

@CarlosEduR's PR could be further optimized and the story might change.

Overall, my preliminary results suggest that on Apple Silicon, it is not worth avoiding a write on the string buffer.

Jul 11 '24 23:07 lemire

This is an excellent PR, @lemire! I appreciate the dedication, really nice ~jab~ job!

I am using a AMD64, I'll run the benckmarks locally.

Jul 12 '24 00:07 CarlosEduR

I appreciate the dedication

It is not super difficult, thankfully... but I am somewhat disappointed so far that we don't see much of an effect.

I am using a AMD64, I'll run the benckmarks locally.

That would be great. Please consider optimizing your own code (removing the != '\0') as this could change the story.

Jul 12 '24 00:07 lemire

partial_tweets<simdjson_ondemand>

Daniel's PR: best_instructions_per_byte=3.31552 best_instructions_per_cycle=3.22185

Master Branch: best_instructions_per_byte=3.30848 best_instructions_per_cycle=3.27537

Carlos' PR best_instructions_per_byte=3.6775 best_instructions_per_cycle=3.27884

find_tweet<simdjson_ondemand>

Master branch: best_instructions_per_byte=2.28167 best_instructions_per_cycle=3.24331

Daniel's PR: best_instructions_per_byte=2.28163 best_instructions_per_cycle=3.23015

Carlos' PR: best_instructions_per_byte=2.28611 best_instructions_per_cycle=3.1972

I've not updated my code yet (removing the != '\0'), will do it and will share results.

Jul 16 '24 02:07 CarlosEduR

@CarlosEduR

I've not updated my code yet (removing the != '\0'), will do it and will share results.

Did you get around to it?

Sep 18 '24 13:09 lemire

Did you get around to it?

yes! partial_tweets and find_tweet benchmarks:

best_instructions_per_byte=3.58215 best_instructions_per_cycle=3.15066
best_instructions_per_byte=2.28551 best_instructions_per_cycle=3.25264

Sep 25 '24 00:09 CarlosEduR

@CarlosEduR

Hmmm, did the number of instructions go up in your find_tweet benchmark?

Just so we are clear, the idea was to reduce the amount of unnecessary work in your PR (check for terminating null). This should improve the performance and reduce the instruction count?

Sep 25 '24 12:09 lemire

fix: add parse_string_if_needed function

partial tweek

find tweet

Conclusion

partial_tweets<simdjson_ondemand>

find_tweet<simdjson_ondemand>