simdjson [WIP] Lifting the padding requirement from simdjson APIs

@jkeiser did a lot of hard work on two PRs to remove the need for padding in the On Demand front-end: https://github.com/simdjson/simdjson/pull/1511 and https://github.com/simdjson/simdjson/pull/1606

These PRs are diverging from our main branch more and more. So we need to refresh them. Let us do so.

[x] merge fully all stale PRs
[x] actually turn on the feature
[x] update documentation
[x] assess performance change
[ ] debug/make tests pass, run through sanitizers
[ ] Recover some of the performance loss.

In DOM, there is an unacceptable performance regression.

In On Demand, there is definitively a performance regression, but it is in the expected range (5%). Factors to take into account: (1) we improve the usability (2) in some cases, we save a copy and some memory usage which might be worth more than 5% (3) we might be able to claw back this extra 5%. At a glance, the regression can be explained by the extra number of instructions we need. For example, we get the worst regression with distinct_user_id (pointer), but the number of instructions went from 2166639 to 2431248, so a 12% uptick. (4) The ~5% range is the kind of performance difference we get just by switching compiler... so it is unlikely to displease an existing user (they will not notice the regression).

task	main branch	this PR	loss (%)
amazon_cellphones	2.04 GB/s	1.92 GB/s	6%
large_amazon_cellphones	2.07 GB/s	2.18 GB/s	-5%
partial_tweets	3.04 GB/s	3.00 GB/s	1%
large_random	0.77 GB/s	0.74 GB/s	4%
large_random (unordered)	0.74 GB/s	0.71 GB/s	4%
kostya	2.29 GB/s	2.18 GB/s	5%
distinct_user_id	3.50 GB/s	3.35 GB/s	4%
distinct_user_id (pointer)	3.14 GB/s	2.85 GB/s	9%
find_tweet	4.81 GB/s	4.81 GB/s	0%
top_tweet	3.40 GB/s	3.41 GB/s	0%
top_tweet (forward)	3.11 GB/s	3.01 GB/s	3 %

It is a PR on top of https://github.com/simdjson/simdjson/pull/1663

Fixes https://github.com/simdjson/simdjson/issues/174

Jul 21 '21 15:07 lemire

The merge should be completed in a few hours.

Jul 22 '21 02:07 lemire

I have begun removing some of the padding from the tests.

There is a whole lot to do, but let us go step-by-step.

Jul 22 '21 19:07 lemire

Tests updated to work without padding.

Jul 22 '21 21:07 lemire

@jkeiser I made good progress. I think it is going to work out.

I believe you had done most of the hard work.

Now we just have to check that everything is solid.

Jul 22 '21 21:07 lemire

@jkeiser @NicolasJiaxin I am planning to merge this as soon as the tests go green. This has been given a lot of consideration and I think we understand it well.

Jul 23 '21 15:07 lemire

Ah. We have a DOM performance regression. Let me investigate.

Jul 23 '21 17:07 lemire

Ah. We saw a slight performance degradation because we have lifted the padding requirement from the DOM API. So I have documented it as removed, while at it. This makes the slight perf. degradation justifiable.

Jul 23 '21 19:07 lemire

Getting there, slowly...

Jul 23 '21 21:07 lemire

The fuzzers have identified genuine problems. Another commit is upcoming.

Jul 24 '21 01:07 lemire

My expectation is now for the performance tests to fail (since we now paying for bound checking), but otherwise, it should pass (including the fuzzers). If so, I will merge.

Jul 24 '21 01:07 lemire

The more time I spend on this, the more depressing it gets.

It breaks everything and seems to imply greater and greater performance tradeoffs.

Jul 24 '21 03:07 lemire

On Rome, Twitter/DOM goes from 2.5 to 2.3. It is not the end of the world, but it is months of optimization work undone. Not to mention all the safety that we seem to sacrifice.

To be revisited.

Jul 24 '21 03:07 lemire

It looks like this will need a lot of work still.

Jul 24 '21 04:07 lemire

Marking as research. This is not ready for 1.0.

Jul 24 '21 13:07 lemire

Marked for 2.0.

Jul 24 '21 14:07 lemire

I have removed this from the 1.0 milestone.

My main concern is that this change introduces many vulnerabilities and would require extensive testing.

Another concern is that it seems to bring about significant performance regressions, sometimes exceeding 10%.

I could live with the loss of performance but not if it comes with a considerable performance penalty on top of it. The two together make for a bad mix.

Jul 25 '21 23:07 lemire

So this will require extensive work.

Jul 25 '21 23:07 lemire

Agreed. If I had time to work on it right now and push it forward I might feel different, but between moving to a new house, kids being home for dinner, and pushing towards a major milestone at work I'm saturated.

It feels like there is a way to do this with low performance loss, but the needle has to be threaded very carefully. The target in my head has been up to 5% performance loss, all told (which is roughly the white noise level of the compiler for Dom).

My only sadness--which is absolutely not a reason to delay 1.0--is that making a performance-for-feature trade-off after 1.0 will be harder to sell and therefore may require more engineering effort to preserve a padded version for users that can afford it. Not that that is bad, just that it's more work for smaller gains. Note that it's possible to put padding back for Dom without changing on demand.

Aug 04 '21 13:08 jkeiser

@jkeiser Thanks. My concern right now has mostly to do with bugs. Pushing something that is not sufficiently well tested as "1.0" would be showing bad judgement.

My only sadness--which is absolutely not a reason to delay 1.0--is that making a performance-for-feature trade-off after 1.0 will be harder to sell and therefore may require more engineering effort to preserve a padded version for users that can afford it. Not that that is bad, just that it's more work for smaller gains.

There is a flip side to this story: as I was working on it, I thought that it was a waste to undo all of the hard work to accelerate processing by going back to a slower version even when users do not do it.

I think that the proper solution might be to use templates to 'branch' on the approach (do you have padding or not?).

Also, from a research angle, there might be other ways to cook this particular chicken. The way I thought you were going to go for is something different... Instead of doing all of these bound checks all over the place... examine the document when you get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).

This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.

I'll open an issue.

Aug 04 '21 14:08 lemire

@jkeiser Note that this PR us probably very close to a working state. I am just nervous about how many bugs I will be missing. Sending this out as version 1.0 does not seem right.

Aug 04 '21 23:08 lemire

@jkeiser As it is, I have been working quite a bit just to get the current code base ready for release and I am not happy, and it has been months. A reset like the one suggested by this PR would require massive engineering efforts to get back at the level of maturity we have.

Aug 07 '21 03:08 lemire

2 years passed since this PR sent, any update here? Or is it given up?

Our client receives JSON data from a server, and have to create a padded_string instance before parsing the data, which is a perf penalty

I have not done tests to check the perf gap between

no padding, just parse the data
padd the data and parse

Can padding be an option? If so, users can choose the best way

Aug 24 '23 02:08 zptan

This PR is obsolete in its current state and is not being pursued.

@zptan If you'd like to pick it up and complete it, that would be great.

Aug 24 '23 13:08 lemire

I just found this commit

https://github.com/Jereq/load-gltf/commit/bc70241396569d9fba93c706055ec6cb4bbb9c5c

So it seems that we already can parse without copying + padding the original json

std::string json_no_padding = "{}";
simdjson::ondemand::parser parser;
simdjson::ondemand::document doc = parser.iterate(json_no_padding, json_no_padding.size() + simdjson::SIMDJSON_PADDING);

Sep 09 '23 15:09 zptan

So it seems that we already can parse without copying + padding the original json

You definitively can do as you describe but if the string is located near the end of a page, and the next page is not allocated, the resulting code could crash.

Sep 09 '23 16:09 lemire