zstd [WIP] fix #2966 part 2 : do not initialize tag space

This diff disables tag initialization when using the rowHash mode. This initialization was unconditional, but it becomes the dominating operation when compressing small data in streaming mode (see issue #2966).

I could not find a good reason to initialize tags. It just makes tag values start at 0, but 0 is just a regular tag value, it's not more significant than any other tag value. Worst case, there will be a wrong hint of match presence, but even that should be filtered out by distance analysis, which remains active through indices validation. So this won't impact compression result.

Now, initially, I was suspicious that it would work, because the tag space is 2x larger than it should be, suggesting additional space is used for something else than tag values, like determining the starting position in the row (which would be an overkill memory budget, but that's a different topic). But to my surprise, this change passes all tests successfully, suggesting rowHash is even resilient to a random start position. Edit : it seems to finally break on msan + fuzzer test below, so that's something worth looking into.

The end result is significant. When combined with #2969, the compression speed of rowHash on small data increases dramatically, as can be seen below (#2969 is required, as otherwise the impact of tag initialization is just lost as part of a larger initialization issue).

The following measurement is taken on a core i7-9700K (turbo disabled) with fullbench -b41, using geldings.txt as sample (a small text file). The test corresponds to a scenario using ZSTD_compressStream() without the benefit of knowing the small sample size beforehand.

level	v1.5.1	#2969 + this PR	comment
1	101 MB/s	113 MB/s
2	67 MB/s	112 MB/s
3	31 MB/s	94 MB/s
4	14 MB/s	93 MB/s
5	9 MB/s	54 MB/s	`rowHash`
6	8.7 MB/s	50 MB/s	`rowHash`
7	4.6 MB/s	50 MB/s	`rowHash`
8	4.5 MB/s	48 MB/s	`rowHash`
9	1.7 MB/s	48 MB/s	`rowHash`
10	0.8 MB/s	45 MB/s	`rowHash`
11	0.8 MB/s	39 MB/s	`rowHash`
12	0.4 MB/s	39 MB/s	`rowHash`
13	0.6 MB/s	38 MB/s

cc @terrelln : there might be some side-effects which are not properly captured by the tests, such as potential reproducibility issues but with a low enough probability that it's too difficult to reproduce during CI tests. And maybe other side effects worth looking into.

Note : WIP, not for merge

Jan 04 '22 17:01 Cyan4973

msan tests fail because reading uninitialized tags is an automatic msan failure, even when the algorithm is suitable to start with uninitialized values. So I guess next step is to mark the tag memory area as "fine" from an msan perspective.

Jan 04 '22 21:01 Cyan4973

@Cyan4973, see here for an example of explicitly unpoisoning memory.

Jan 04 '22 21:01 felixhandte

msan unpoisoning works for msan tests,

but there is still a remaining issue within zlibwrapper tests. I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

This, btw, is a good demo of what could happen if any other application linking libzstd does attempt msan tests.

Jan 04 '22 23:01 Cyan4973

I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

That is not supported by MSAN. I'm surprised it was working as-is. So we should fix that test to compiler libzstd with MSAN.

MSAN needs all code to be compiled with MSAN in order to work correctly. Unlike ASAN, which can work with linked code that isn't compiled with ASAN, it will just miss bugs in the non-ASAN code.

Jan 04 '22 23:01 terrelln

Oh, thats a valgrind check, not MSAN.

Jan 04 '22 23:01 terrelln

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

Jan 05 '22 22:01 felixhandte

The big question is whether we also want to get this in for 1.5.2.

Jan 05 '22 22:01 felixhandte

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

Jan 05 '22 22:01 Cyan4973

The big question is whether we also want to get this in for 1.5.2.

Yes, we do, otherwise it's effectively an important speed regression (for a specific scenario) when compared to v1.4.9.

Jan 05 '22 23:01 Cyan4973

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

Although, it's not guaranteed that these movements of the tag space generate a single expanding memory region, they may end up defining multiple distant regions, in which case tracking becomes impractical, so we'll have to accept some inefficiency (i.e. initialization will happen more often than it could if tracking of initialized bytes was perfect).

I wonder now how competitive could be initialization with calloc().

edit: on first look, it seems it could be made speed-equivalent. edit 2 : well, not really, according to fullbench -b43, using calloc() to allocate and initialize workspace makes the compression operation way slower when workspace is continuously recreated and free.

Jan 06 '22 02:01 Cyan4973

I think a sane approach would be to change the order that the cwksp stores elements.

Instead of:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... aligned][<- ... buffers]

Make it:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... buffers][<- ... aligned uninitialized][<- ... aligned initialized]

Now we just need to keep track of where the boundary between aligned uninitialized & aligned initialized is, and if the initialized section grows, do that initialization.

Jan 07 '22 00:01 terrelln

I have a branch that I modified to take that approach.

I do have a problem though... With context reuse compression is slower when I don't memset the tag table. I think whats happening is that we're getting tags that match from the previous compression, but are out of our window. This causes extra work in cases where there otherwise wouldn't be any matches.

Basically, I think that this branch:

https://github.com/facebook/zstd/blob/26d88c05f089cc1f562ace949272c8919c93153f/lib/compress/zstd_lazy.c#L1208

rarely happens when the tables are sized to fit the input. Since we filter out at the tag step instead. But, when we remove the memset, and are compressing similar data, we may find matches that were from the previous file. And the tag won't filter out those matches, so we will hit this branch instead.

I'm going to look into it.

Jan 07 '22 02:01 terrelln

Running zstd -b5e5 enwik7 -B128K I see that the branch:

https://github.com/facebook/zstd/blob/26d88c05f089cc1f562ace949272c8919c93153f/lib/compress/zstd_lazy.c#L1208

is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

Jan 07 '22 02:01 terrelln

Testing zstd -b5e5 enwik7 -B128K on my desktop, I indeed notice a drop from 85 MB/s to 80 MB/s when not memsetting the tag area.

To be fair, I was expecting worse given the x150 increase in branch account .

When removing the -B128K, then there is no more difference. And on the other side, with -B16K, the difference is very small (<2%).

Which makes -B128K the odd worst-case in the middle ?

Jan 07 '22 16:01 Cyan4973

I think a sane approach would be to change the order that the cwksp stores elements. [...]

Yeah, I guess so. The reason it was laid out this way was so that even if the buffers require a non-round number of bytes, all of the allocations could fit in a workspace with no extra padding bytes while respecting the alignment requirements of everything. I guess we've already backed away from this though with aligning the tables to 64 bytes. You will probably need to add 3 more padding bytes though to the workspace budget.

Jan 07 '22 17:01 felixhandte

I do have a problem though... With context reuse compression is slower when I don't memset the tag table.

This PR started as a v1.5.2 performance fix for small data in streaming mode. If we now have to balance priorities with other scenarios that would be negatively impacted, this kind of changes the scope.

Consequently, we may need more time to analyze trade off and consider alternatives. In which case, this seems no longer an item for v1.5.2.

Jan 07 '22 17:01 Cyan4973

I see that the branch: is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

What about :

creating a mask from the index table
& it with the tag mask
now we have a single mask, each position is guaranteed to be both within range and correspond to the tag
the branch can be removed

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

Jan 08 '22 07:01 Cyan4973

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

It involves looping over the table. I think that this idea could work, and even reduce branches in all cases. It is definitely possible this could be a speed gain in general. But it remains to be tested.

Jan 24 '22 22:01 terrelln

More complete solution merged, see #3528

Mar 13 '23 23:03 yoniko