Optimize ZSTD_decodeSequence when ofBits==0
This patch adds a branch to a previously branchless code in decompress hot loop handling the case where ofBits == 0.
Even though a branch is added, the branch saves on instructions that introduce memory dependency an unneeded memory operations when the condition isn't met.
Testing on intel Skylake shows positive decompression speed improvements across different corpora and compilers, with speed improvements of 1% to 7%. On M1 Macbook Pro performance is mostly neutral with a possible very small regression.
Full benchmark results - https://docs.google.com/spreadsheets/d/1hEUY5Gkf6Ebz6Gq5X9U5mURC_SsI43BhIFpE7uBDVsw/edit?usp=sharing
Seems reasonable for most data, since we probably almost never use ll0 repcodes. I wonder what the perf looks like when we do. E.g. maybe kennedy.xls has this pattern.
I will run benchmarks on my server as well