zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Vectorize `blockSize_explicitDelimiter()`

Open embg opened this issue 2 years ago • 0 comments

The function blockSize_explicitDelimiter() is used to validate that sum(matchLen) + sum(litLen) for external sequences adds up to the correct blockSize. The hot loop is:

    while (spos < inSeqsSize) {
        end = (inSeqs[spos].offset == 0);
        blockSize += inSeqs[spos].litLength + inSeqs[spos].matchLength;
        if (end) {
            if (inSeqs[spos].matchLength != 0)
                return 42; // error
            break;
        }
        spos++;
    }

Godbolt indicates that clang15 compiles this to a naive scalar loop: https://godbolt.org/z/osjE5KeP6. gcc12 yields a similar result. uiCA gives the naive loop a throughput of 3 cycles/sequence, which seems pretty slow to me: https://bit.ly/401i7ZD.

I think we can do significantly better. For example, this loop could be vectorized by loading one ZSTD_Sequence struct into a SSE vector (or two ZSTD_Sequence structs into an AVX vector) and performing 2 (or 4) parallel sums. Skylake Xeon can do 2 (load+VPADD)/cycle (according to uops.info), so with multiple accumulators and unrolling we could get up to 4 (or 8) parallel sums. Of course, we'd need to be careful about 32-bit overflow, but even with vectorized overflow checks this solution should be much faster.

embg avatar Jan 25 '23 22:01 embg