zstd
zstd copied to clipboard
Vectorize `blockSize_explicitDelimiter()`
The function blockSize_explicitDelimiter() is used to validate that sum(matchLen) + sum(litLen) for external sequences adds up to the correct blockSize. The hot loop is:
while (spos < inSeqsSize) {
end = (inSeqs[spos].offset == 0);
blockSize += inSeqs[spos].litLength + inSeqs[spos].matchLength;
if (end) {
if (inSeqs[spos].matchLength != 0)
return 42; // error
break;
}
spos++;
}
Godbolt indicates that clang15 compiles this to a naive scalar loop: https://godbolt.org/z/osjE5KeP6. gcc12 yields a similar result. uiCA gives the naive loop a throughput of 3 cycles/sequence, which seems pretty slow to me: https://bit.ly/401i7ZD.
I think we can do significantly better. For example, this loop could be vectorized by loading one ZSTD_Sequence struct into a SSE vector (or two ZSTD_Sequence structs into an AVX vector) and performing 2 (or 4) parallel sums. Skylake Xeon can do 2 (load+VPADD)/cycle (according to uops.info), so with multiple accumulators and unrolling we could get up to 4 (or 8) parallel sums. Of course, we'd need to be careful about 32-bit overflow, but even with vectorized overflow checks this solution should be much faster.