base64
base64 copied to clipboard
Whitespace character filtering
This is a new issue to discuss whitespace character filtering as mentioned in #15 and #27
The implementation will probably depend on density of whitespace characters.
- few occurrences, not grouped together (e.g. one new line every 80 characters): update error handling to filter those characters (easy, fast, tested on my end).
- few occurrences of group of whitespace (e.g. one new line + n spaces every 80 characters): update error handling to filter those characters by skipping multiple characters at a time (not tested at all on my end).
- many occurrences: a preprocess step might be more suited. This is still quite slow (1st implementation was based on code snippet from #15, changed that with a LUT for the shuffle mask): SSE4.2 implementation
while (srclen >= 32U) {
// Characters to be checked (all should be valid, repeat the first ones):
const __m128i whitespaces = _mm_setr_epi8(
'\n','\r','\t', ' ',
'\n','\r','\t', ' ',
'\n','\r','\t', ' ',
'\n','\r','\t', ' '
);
static const uint8_t lut[][16] __attribute__((aligned (16))) = {
{ 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
{ 1U, 2U, 3U, 4U, 5U, 6U, 7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
...,
{ 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U }
};
static const uint8_t ipopcnt_lut[] = {
8U, 7U, 7U, 6U, 7U, 6U, 6U, 5U, 7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U, 3U, 2U, 2U, 1U, 2U, 1U, 1U, 0U
};
__m128i c0 = _mm_loadu_si128((const __m128i*)(src + 0));
__m128i c2 = _mm_loadu_si128((const __m128i*)(src + 16));
src += 32;
srclen -= 32U;
__m128i wm0 = _mm_cmpistrm(whitespaces, c0, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
__m128i wm2 = _mm_cmpistrm(whitespaces, c2, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
unsigned int mi0 = (unsigned int)_mm_cvtsi128_si32(wm0);
unsigned int mi2 = (unsigned int)_mm_cvtsi128_si32(wm2);
if (mi0 | mi2) {
unsigned int i0 = ipopcnt_lut[mi0 & 255U];
unsigned int i1 = ipopcnt_lut[mi0 >> 8];
unsigned int i2 = ipopcnt_lut[mi2 & 255U];
unsigned int i3 = ipopcnt_lut[mi2 >> 8];
__m128i c1 = _mm_srli_si128(c0, 8);
__m128i c3 = _mm_srli_si128(c2, 8);
c0 = _mm_shuffle_epi8(c0, _mm_load_si128(lut[mi0 & 255U]));
c1 = _mm_shuffle_epi8(c1, _mm_load_si128(lut[mi0 >> 8]));
c2 = _mm_shuffle_epi8(c2, _mm_load_si128(lut[mi2 & 255U]));
c3 = _mm_shuffle_epi8(c3, _mm_load_si128(lut[mi2 >> 8]));
_mm_storel_epi64((__m128i*)dst, c0);
dst += i0;
_mm_storel_epi64((__m128i*)dst, c1);
dst += i1;
_mm_storel_epi64((__m128i*)dst, c2);
dst += i2;
_mm_storel_epi64((__m128i*)dst, c3);
dst += i3;
}
else {
_mm_storeu_si128((__m128i*)(dst + 0), c0);
_mm_storeu_si128((__m128i*)(dst + 16), c2);
dst += 32;
}
}
Here are the timings. All throughputs are given for the output buffer which is the same for all variants.
decode: valid base64 input
decode-s: one whitespace every 80 characters (we can see the choice of 80 has an impact on AVX2 decoder for method 1 because it's a multiple of 16 but not 32)
decode-s8: 8 whitespace every 80 characters
decode-d: 1 whitespace before each valid character.
Method 1 seems to be the best for sparse whitespace characters except for AVX2 decoder (that could/should be fixed to handle 8/16 bytes valid input). For handling sparse group of characters or very dense whitespace characters, Method 3 is better.
Real world data analysis would be good to know what case we want to optimize.
For Method 1:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 5279.50 MB/sec
AVX2 decode-s 1587.19 MB/sec
AVX2 decode-s8 1005.11 MB/sec
AVX2 decode-d 226.64 MB/sec
plain decode 1234.00 MB/sec
plain decode-s 1180.52 MB/sec
plain decode-s8 899.77 MB/sec
plain decode-d 245.90 MB/sec
SSSE3 decode 2941.00 MB/sec
SSSE3 decode-s 2414.34 MB/sec
SSSE3 decode-s8 1198.43 MB/sec
SSSE3 decode-d 210.47 MB/sec
SSE41 decode 2938.98 MB/sec
SSE41 decode-s 2369.73 MB/sec
SSE41 decode-s8 1167.77 MB/sec
SSE41 decode-d 209.58 MB/sec
SSE42 decode 3820.97 MB/sec
SSE42 decode-s 3469.99 MB/sec
SSE42 decode-s8 2009.10 MB/sec
SSE42 decode-d 273.74 MB/sec
AVX decode 3959.44 MB/sec
AVX decode-s 3573.91 MB/sec
AVX decode-s8 2051.51 MB/sec
AVX decode-d 233.33 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 5302.67 MB/sec
AVX2 decode-s 1590.81 MB/sec
AVX2 decode-s8 1004.56 MB/sec
AVX2 decode-d 226.81 MB/sec
plain decode 1238.86 MB/sec
plain decode-s 1184.43 MB/sec
plain decode-s8 903.14 MB/sec
plain decode-d 241.19 MB/sec
SSSE3 decode 2952.08 MB/sec
SSSE3 decode-s 2472.69 MB/sec
SSSE3 decode-s8 1157.22 MB/sec
SSSE3 decode-d 211.32 MB/sec
SSE41 decode 2952.81 MB/sec
SSE41 decode-s 2477.51 MB/sec
SSE41 decode-s8 1201.25 MB/sec
SSE41 decode-d 210.74 MB/sec
SSE42 decode 3831.55 MB/sec
SSE42 decode-s 3487.39 MB/sec
SSE42 decode-s8 2101.31 MB/sec
SSE42 decode-d 275.14 MB/sec
AVX decode 3967.55 MB/sec
AVX decode-s 3507.18 MB/sec
AVX decode-s8 2055.18 MB/sec
AVX decode-d 225.99 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 5272.91 MB/sec
AVX2 decode-s 1592.12 MB/sec
AVX2 decode-s8 998.94 MB/sec
AVX2 decode-d 226.41 MB/sec
plain decode 1237.12 MB/sec
plain decode-s 1183.57 MB/sec
plain decode-s8 902.16 MB/sec
plain decode-d 241.08 MB/sec
SSSE3 decode 2920.62 MB/sec
SSSE3 decode-s 2401.63 MB/sec
SSSE3 decode-s8 1165.97 MB/sec
SSSE3 decode-d 213.04 MB/sec
SSE41 decode 2952.13 MB/sec
SSE41 decode-s 2473.57 MB/sec
SSE41 decode-s8 1170.04 MB/sec
SSE41 decode-d 210.97 MB/sec
SSE42 decode 3826.16 MB/sec
SSE42 decode-s 3484.39 MB/sec
SSE42 decode-s8 1877.99 MB/sec
SSE42 decode-d 274.92 MB/sec
AVX decode 3860.69 MB/sec
AVX decode-s 3491.69 MB/sec
AVX decode-s8 2030.65 MB/sec
AVX decode-d 222.85 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 5213.04 MB/sec
AVX2 decode-s 1608.52 MB/sec
AVX2 decode-s8 1012.05 MB/sec
AVX2 decode-d 231.93 MB/sec
plain decode 1237.74 MB/sec
plain decode-s 1184.74 MB/sec
plain decode-s8 903.57 MB/sec
plain decode-d 250.86 MB/sec
SSSE3 decode 2942.19 MB/sec
SSSE3 decode-s 2464.90 MB/sec
SSSE3 decode-s8 1197.61 MB/sec
SSSE3 decode-d 219.08 MB/sec
SSE41 decode 2946.40 MB/sec
SSE41 decode-s 2468.99 MB/sec
SSE41 decode-s8 1165.37 MB/sec
SSE41 decode-d 214.67 MB/sec
SSE42 decode 3703.31 MB/sec
SSE42 decode-s 3372.53 MB/sec
SSE42 decode-s8 1814.50 MB/sec
SSE42 decode-d 267.22 MB/sec
AVX decode 3843.75 MB/sec
AVX decode-s 3460.58 MB/sec
AVX decode-s8 2030.02 MB/sec
AVX decode-d 227.47 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 4314.66 MB/sec
AVX2 decode-s 1562.26 MB/sec
AVX2 decode-s8 1024.11 MB/sec
AVX2 decode-d 233.38 MB/sec
plain decode 1207.16 MB/sec
plain decode-s 1129.71 MB/sec
plain decode-s8 904.50 MB/sec
plain decode-d 249.83 MB/sec
SSSE3 decode 2714.32 MB/sec
SSSE3 decode-s 2369.40 MB/sec
SSSE3 decode-s8 1203.05 MB/sec
SSSE3 decode-d 219.10 MB/sec
SSE41 decode 2716.16 MB/sec
SSE41 decode-s 2392.42 MB/sec
SSE41 decode-s8 1203.59 MB/sec
SSE41 decode-d 215.04 MB/sec
SSE42 decode 3418.99 MB/sec
SSE42 decode-s 3181.47 MB/sec
SSE42 decode-s8 1904.47 MB/sec
SSE42 decode-d 283.45 MB/sec
AVX decode 3556.95 MB/sec
AVX decode-s 3238.04 MB/sec
AVX decode-s8 1986.58 MB/sec
AVX decode-d 233.68 MB/sec
For Method 3:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 3230.06 MB/sec
AVX2 decode-s 2741.82 MB/sec
AVX2 decode-s8 2874.97 MB/sec
AVX2 decode-d 1558.49 MB/sec
plain decode 619.40 MB/sec
plain decode-s 548.73 MB/sec
plain decode-s8 548.71 MB/sec
plain decode-d 420.43 MB/sec
SSSE3 decode 2057.59 MB/sec
SSSE3 decode-s 1807.58 MB/sec
SSSE3 decode-s8 1920.25 MB/sec
SSSE3 decode-d 1257.10 MB/sec
SSE41 decode 2123.90 MB/sec
SSE41 decode-s 1890.92 MB/sec
SSE41 decode-s8 1883.30 MB/sec
SSE41 decode-d 1209.47 MB/sec
SSE42 decode 2622.27 MB/sec
SSE42 decode-s 2366.88 MB/sec
SSE42 decode-s8 2421.09 MB/sec
SSE42 decode-d 1435.30 MB/sec
AVX decode 2777.55 MB/sec
AVX decode-s 2419.70 MB/sec
AVX decode-s8 2480.22 MB/sec
AVX decode-d 1450.81 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 3378.80 MB/sec
AVX2 decode-s 2775.84 MB/sec
AVX2 decode-s8 2892.33 MB/sec
AVX2 decode-d 1590.29 MB/sec
plain decode 626.00 MB/sec
plain decode-s 580.68 MB/sec
plain decode-s8 562.96 MB/sec
plain decode-d 408.17 MB/sec
SSSE3 decode 2191.44 MB/sec
SSSE3 decode-s 1892.64 MB/sec
SSSE3 decode-s8 1861.03 MB/sec
SSSE3 decode-d 1227.95 MB/sec
SSE41 decode 2140.16 MB/sec
SSE41 decode-s 1893.56 MB/sec
SSE41 decode-s8 1911.62 MB/sec
SSE41 decode-d 1226.44 MB/sec
SSE42 decode 2724.52 MB/sec
SSE42 decode-s 2257.47 MB/sec
SSE42 decode-s8 2370.60 MB/sec
SSE42 decode-d 1424.35 MB/sec
AVX decode 2743.86 MB/sec
AVX decode-s 2433.86 MB/sec
AVX decode-s8 2428.08 MB/sec
AVX decode-d 1408.13 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 3284.48 MB/sec
AVX2 decode-s 2735.35 MB/sec
AVX2 decode-s8 2813.35 MB/sec
AVX2 decode-d 1574.50 MB/sec
plain decode 623.97 MB/sec
plain decode-s 570.76 MB/sec
plain decode-s8 566.80 MB/sec
plain decode-d 406.15 MB/sec
SSSE3 decode 2191.73 MB/sec
SSSE3 decode-s 1905.69 MB/sec
SSSE3 decode-s8 1843.39 MB/sec
SSSE3 decode-d 1261.96 MB/sec
SSE41 decode 2190.32 MB/sec
SSE41 decode-s 1938.25 MB/sec
SSE41 decode-s8 1841.31 MB/sec
SSE41 decode-d 1259.89 MB/sec
SSE42 decode 2738.85 MB/sec
SSE42 decode-s 2408.05 MB/sec
SSE42 decode-s8 2430.57 MB/sec
SSE42 decode-d 1437.67 MB/sec
AVX decode 2818.79 MB/sec
AVX decode-s 2388.06 MB/sec
AVX decode-s8 2494.78 MB/sec
AVX decode-d 1455.20 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 3345.58 MB/sec
AVX2 decode-s 2842.44 MB/sec
AVX2 decode-s8 2887.61 MB/sec
AVX2 decode-d 1561.51 MB/sec
plain decode 634.51 MB/sec
plain decode-s 591.72 MB/sec
plain decode-s8 573.85 MB/sec
plain decode-d 464.49 MB/sec
SSSE3 decode 2182.45 MB/sec
SSSE3 decode-s 1927.53 MB/sec
SSSE3 decode-s8 1933.26 MB/sec
SSSE3 decode-d 1248.62 MB/sec
SSE41 decode 2183.39 MB/sec
SSE41 decode-s 1954.86 MB/sec
SSE41 decode-s8 1935.52 MB/sec
SSE41 decode-d 1247.71 MB/sec
SSE42 decode 2726.89 MB/sec
SSE42 decode-s 2421.89 MB/sec
SSE42 decode-s8 2432.85 MB/sec
SSE42 decode-d 1421.62 MB/sec
AVX decode 2809.99 MB/sec
AVX decode-s 2470.29 MB/sec
AVX decode-s8 2496.32 MB/sec
AVX decode-d 1432.62 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 2964.24 MB/sec
AVX2 decode-s 2619.61 MB/sec
AVX2 decode-s8 2530.74 MB/sec
AVX2 decode-d 1536.01 MB/sec
plain decode 621.43 MB/sec
plain decode-s 588.34 MB/sec
plain decode-s8 567.04 MB/sec
plain decode-d 459.67 MB/sec
SSSE3 decode 2011.50 MB/sec
SSSE3 decode-s 1848.21 MB/sec
SSSE3 decode-s8 1764.75 MB/sec
SSSE3 decode-d 1231.03 MB/sec
SSE41 decode 2001.69 MB/sec
SSE41 decode-s 1846.22 MB/sec
SSE41 decode-s8 1765.15 MB/sec
SSE41 decode-d 1229.00 MB/sec
SSE42 decode 2507.66 MB/sec
SSE42 decode-s 2254.18 MB/sec
SSE42 decode-s8 2162.37 MB/sec
SSE42 decode-d 1405.58 MB/sec
AVX decode 2582.07 MB/sec
AVX decode-s 2305.40 MB/sec
AVX decode-s8 2281.94 MB/sec
AVX decode-d 1414.54 MB/sec
ssse3 with smaller lut https://gist.github.com/aqrit/6e73ca6ff52f72a2b121d584745f89f3
Is there any interest in adopting this PR?
We are looking into adopting this library, and ran into an issue with some keys we expect to decode have \n characters.
@izoroth Short answer: not currently, no.
Long answer: Well, it's not really a PR, it's a series of sketches of what a despacing algorithm might look like. There's no definite patchset to merge, and if there was, the patches would probably not apply on top of the current master branch. There would also be API changes to consider. And there is a discussion to be had about whether to optimize for sparse whitespace or dense whitespace.
I think that whitespace filtering could make sense in this library as an opt-in feature, but it's not fleshed out enough yet.