base64
base64 copied to clipboard
Decoding from uint16_t[]
First off, thank you for providing this open source library (and under the BSD license).
My question is: would it be possible to see versions of the decoding functions that support uint16_t[]
strings in addition to const char[]
strings? I tried looking at just the ssse3 implementation (since that is what is used on my dev machine) to see how that might be implemented, but the numbers in the comments have me confused.
My particular use case is that I get a uint16_t[]
from the V8 JS engine, which is how it internally stores its strings. Being able to use that value type directly would be more efficient than having to make a copy to an 8-bit array first.
I'd like to get uint16_t[]
supported for each codec, since we run on a variety of platforms. Any help you can give would be much appreciated.
I'm a bit unclear on what exactly you want to do, and with which data type. Do you want to base64-decode an uint16_t
array which you get from V8? In that case, couldn't you just typecast the array as a series of 8-bit bytes to circumvent the type system? Something like this:
// `src` is an array of uint16_t.
// `srclen` is the length of `src` counted in 16-bit characters.
int ret = base64_decode((char *)src, srclen * 2, out, &outlen, 0);
We cast src
to type char *
, and double the length count, because it's assumed to be counting off 16-bit characters, which contain two octets each.
I have a feeling that's not what you mean though. Is it something more like this? V8 is giving you a base64-encoded string where each 16-bit character only has the lower byte set. Your plain ASCII strings (including your encoded base64 strings) look like this:
0x0048 | 0x0045 | 0x004C | 0x004C | 0x004F
H E L L O
So in order to do the base64 decoding on plain, unpadded 8-bit input, you would need to preprocess the input to squeeze out every second byte. Reversely, to encode a plain 8-bit bytestream, you would have to "upsample" the output from 8-bit to this 16-bit representation by adding padding bytes. Is that what you mean?
@aklomp FYI V8 uses UTF-16 for representing strings and abuses uint16_t as raw UTF-16 code unit type. So while the high byte is always zero for valid base64 input, it isn't sufficient to "squeeze out every second byte", because invalid input (i.e. non base64 characters) might be silently converted to valid input which might be dangerous in certain contexts.
Yes, what @BurningEnlightenment said. The example of simply casting to a char array won't work because of the leading zero bytes causing early termination.
Now that I've thought about it some more, it makes more sense to keep the length the same and just add a check on the upper byte and break early if it's non-zero.
So the proposed solution would be to duplicate the API and the underlying logic to add first-class support for these wide chars (as I'll call them).
I'm reluctant to do this, because of separation of concerns. In my view, this library does one simple thing, but does it in the best possible way, and that is encoding and decoding of flat base64 strings. I see any enlargement of that scope as being principally undesirable.
So I'm reserved about making the big changes that would be necessary to support wide chars in the streaming API. I'd need to duplicate the API, add flags into all the codecs, and probably duplicate code. I'd really rather not, though it might be interesting to set up a development branch to test how feasible all this is.
The good news is that for the case where you don't need streaming conversions (i.e. you have a fixed buffer of known length), you can get along with implementing simple input and output wrappers around the non-streaming API functions (base64_encode()
and base64_decode()
). The input/output filters are a good example of separating the concerns into a module of their own.
The input filter for decoding would (1) check if any of the high bytes are set, and (2) squeeze out every second byte. It would then feed the modified buffer to base64_encode()
(the non-streaming function). Both (1) and (2) can be done super efficiently with SIMD instructions. ((1) could be done by or
'ing all input bytes together, then checking if any of the high bits are set. If so, return error. (2) could be done with byte shuffling.)
The output filter for encoding would simply pad the output with zero bytes. This too can be done very efficiently with SIMD instructions. Of course, SIMD instructions work on many bytes at the same time, so you'd also need to handle individual bytes bytes at the tail, and so on. But on the whole it's not infeasible.
Also, if you're not interested in duplication of the whole API and what not, and if you're fine with using a development branch, you could "hack" the streaming codecs to do the byte shuffling on the fly. Was that the kind of adaptation you were alluding to when you said that you looked at the SSSE3 implementation but were confused by the numbers? I could perhaps help you there.
Oops, guess I shouldn't think before coffee.... the non-plain codec changes won't be that trivial. I really only need 16-bit decoding, so that makes it a little easier.
I'm still a little stumped as to what the non-plain codec changes would need to be (as far as adjusting the calculations and such), as I'm not very familiar with the intrinsics. After some brief research it does look like there are 16-bit instrinsics that could be used, but there are other (less obvious, to me) changes that I suspect would also need to be made, such as a 16-bit-compatible dec_reshuffle()
.
EDIT: I don't mind duplicating the decode functions locally, but figuring out the necessary (efficient) instrinsics changes is what I guess I'd be struggling with.
EDIT 2: I should also note, I'm currently not using the streaming functionality, just base64_decode()
for decoding.
If you're not using the streaming functions, why not use a two-stage pipeline and preprocess the input string using SSSE3 functions before sending it to the decoder? That should be roughly as fast as doing the conversion on the fly (we're doing the same amount of work), but wouldn't require any changes to the base64 library. You'd never have to touch the decoder code. You can do the conversion in-place, because it's guaranteed that the output will always be shorter than the input.
As for the preprocessor function, it would contain a loop that branches on a condition: if the remaining input size is >= 16 bytes, then use the SSSE3 code, else use the fallback code (which works byte-by-byte). The fallback code is pretty trivial. The SSSE3 code can look something like this (off-the-cuff pseudocode and not tested):
// `in` is an __mm128i containing 8 16-bit characters.
// `o` is a uint8_t* pointing to the output buffer.
// Compare with zero.
// `iszero` will contain 0xFF where input byte was zero, 0x00 where it was not.
__mm128i iszero = _mm_cmpeq_epi8(in, _mm_zero_si128());
// Extract a bitmask from the iszero mask.
// Bits will be 1 if the corresponding byte in `in` was equal to zero:
int mask = _mm_movemask_epi8(iszero);
// Check if any of the high bytes were nonzero, return error if so:
// FIXME: this is GCC-only binary literal syntax, just to give the idea):
if ((mask & 0b0101010101010101) != 0b0101010101010101)
return -1;
// Squeeze out every high byte, so that only the low bytes remain:
// FIXME: not sure of the byte mask pattern.
__mm128i shuf = _mm_shuffle_epi8(in, _mm_set_epi8(
-1, -1, -1, -1,
-1, -1, -1, -1,
0, 2, 4, 6,
8, 10, 12, 14));
// Write back 16 bytes, of which 8 bytes are output:
_mm_storeu_si128((__m128i *)o, shuf);
// Move output pointer 8 bytes:
o += 8;
Thanks, that got me started in the right direction. FWIW the byte mask pattern ended up needing to be (at least for little endian?):
-1, -1, -1, -1,
-1, -1, -1, -1,
14, 12, 10, 8,
6, 4, 2, 0
The high byte check isn't working, but I haven't really looked at that yet.
One other problem I have when decoding is that I need to ignore invalid characters instead of returning early on such characters. Currently I'm doing something like this inside dec/ssse3.c to ignore invalid characters:
int invalid = _mm_movemask_epi8(CMPEQ(delta, 0));
if (invalid) {
// Pack valid bytes together
uint8_t pos = 0;
#define CHECK_BYTE(n) \
if (!(invalid & 1<<n)) \
c[pos++] = c[n];
CHECK_BYTE(0);
CHECK_BYTE(1);
CHECK_BYTE(2);
CHECK_BYTE(3);
CHECK_BYTE(4);
CHECK_BYTE(5);
CHECK_BYTE(6);
CHECK_BYTE(7);
CHECK_BYTE(8);
CHECK_BYTE(9);
CHECK_BYTE(10);
CHECK_BYTE(11);
CHECK_BYTE(12);
CHECK_BYTE(13);
CHECK_BYTE(14);
CHECK_BYTE(15);
// Shift remaining bytes to fill in gap created by invalid bytes
memmove(c + pos, c + 16, srclen - (16 - pos));
// Retry current position with the newly shifted in data, with the source
// length adjusted appropriately
srclen -= (16 - pos);
continue;
}
This seems to work, but do you know if there is a more efficient way to do this? The only other way I could think of would be to load position changes into a new __m128i
, shuffling the bytes, storing the result back into the source, and then memmove()
and continue. However, that seemed like that would be slower/take more instructions than just manually copying the bytes where they need to be?
The reason to ignore invalid bytes on decode is to skip newlines and carriage returns, right? I've been thinking about adding an option to the library for that actually. As you've found, it's not trivial in pure SSE to remove those bytes from the input stream and shift over the rest. But there must be a way, there's usually a crafty left-field approach if you think about it hard enough. I'll see if I can come up with something.
In the meantime, you could perhaps get away with a slight alteration in the current library. If the SSE code now encounters an invalid input char, it falls back to the bytewise, per-character handling in lib/dec/tail.c
. This rechecks the input character. It does a lookup of that character in base64_table_dec[]
, and if the result is 255
, it returns an error. You could patch the code to not return an error but silently skip the character.
This starts to get slow if you have a lot of invalid characters, since you're doing bytewise decoding on each one of them, but if there's occasionally a character here or there, it shouldn't matter too much.
The main purpose is to skip whitespace, but all other invalid characters should be ignored as well. The reason being I need it for backwards compatibility with an existing base64 decoder.
Here's the scheme I've come up with so far. It's probably not optimal, and I didn't benchmark or even compile it.
Say we have these nine valid bytes of input. We pick all nine of them for the output:
in: A B C D E F G H I
shuffle: 0 1 2 3 4 5 6 7 8
-------------------------------------
out: A B C D E F G H I
Now we have the same thing but with two invalid characters. Each time we encounter an invalid character, we need to increment the shuffle mask index:
in: A B C _ D E _ F G
shuffle: 0 1 2 4 5 7 8 -1 -1
-------------------------------------
out: A B C D E F G ? ?
Also, in this example the total valid length becomes two characters shorter. We must account for that when we calculate outlen
and increment the output pointer.
This is probably not the fastest possible method, but we can construct the shuffle mask on-the-fly from the bitmask. In code, it might look something like this (untested):
int i = 0;
// Invalid character bitmask:
uint16_t mask = _mm_movemask_epi8(CMPEQ(delta, 0));
// Create temporary array to construct shuffle mask:
uint8_t shm[16] __attribute__((align (16)));
// Create shuffle mask:
for (int j = 0; j < 16; j++)
if (!(mask & (1U << j)))
shm[i++] = j;
// This is not strictly necessary:
while (i < 16)
shm[i++] = -1;
// Shuffle input string:
str = _mm_shuffle_epi8(str, _mm_load_si128((__m128i *)shm));
// Number of error characters encountered
// (the string length should be decremented by this amount):
int errchars = __builtin_popcount(mask);
I didn't test this code, so maybe the endianness is wrong or something. Just to give an idea.
EDIT: the first version of the loop was incorrect.
Regarding the uint16_t
to char
conversion as a preprocessing step, why not just use _mm_packus_epi16
. There's no need to check for invalid characters with that pre-processing stage, the check will be done as usual. "Negative" (if highest bit is set) values will be converted to 0
which is invalid and values greater than 255
will be saturated to 255
which also is invalid. This should be much faster than the SSSE3
method proposed in earlier comment.
The first stage shall also be limited to only consume data that fits L1 cache before passing it to the second stage. Loop while necessary.
Without taking into account the whitespace skipping (which would also apply to the char
version), I made a small POC of what could be done to support uint16_t[]
decoding.
I tested 2 things:
- Using 2 stages in a loop as mentioned before (to keep data in L1 cache)
- Add decode16 function for all arch (except neon32/neon64 at the moment)
What I like about the first one is that it requires less work and would allow for other preprocessing stages to be implemented the same way. But... it's slower than the second one. What I like about the second one is that it's faster. But... Requires to add an extra codec function for each arch, this approach might not work on some architectures (No uint32_t/uint64_t generic here, I will need to test NEON but even with that, we don't know what the future will bring us).
The second implementation didn't required that much code duplication after all. Overall, it's 15% faster.
AVX2: +9%
AVX: +19%
plain: -7% (no uint64_t
decode16)
SSSE3: +17%
SSE41: +17%
SSE42: +18%
Here are the results from benchmark
(keeping only decoding, buffer size is output buffer size):
2 stages implementation:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 7071.56 MB/sec
AVX2 decode16 3873.48 MB/sec
plain decode 1557.58 MB/sec
plain decode16 695.27 MB/sec
SSSE3 decode 3875.35 MB/sec
SSSE3 decode16 2712.32 MB/sec
SSE41 decode 3876.45 MB/sec
SSE41 decode16 2706.70 MB/sec
SSE42 decode 5097.40 MB/sec
SSE42 decode16 3228.69 MB/sec
AVX decode 5284.70 MB/sec
AVX decode16 3300.12 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 7054.40 MB/sec
AVX2 decode16 4806.09 MB/sec
plain decode 1562.27 MB/sec
plain decode16 699.32 MB/sec
SSSE3 decode 3887.29 MB/sec
SSSE3 decode16 3161.14 MB/sec
SSE41 decode 3887.03 MB/sec
SSE41 decode16 3115.06 MB/sec
SSE42 decode 5108.83 MB/sec
SSE42 decode16 3916.78 MB/sec
AVX decode 5297.71 MB/sec
AVX decode16 4019.34 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 7034.43 MB/sec
AVX2 decode16 5078.66 MB/sec
plain decode 1555.46 MB/sec
plain decode16 695.94 MB/sec
SSSE3 decode 3881.58 MB/sec
SSSE3 decode16 3242.03 MB/sec
SSE41 decode 3694.30 MB/sec
SSE41 decode16 3227.33 MB/sec
SSE42 decode 5103.54 MB/sec
SSE42 decode16 4040.98 MB/sec
AVX decode 5290.70 MB/sec
AVX decode16 4137.45 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 6919.64 MB/sec
AVX2 decode16 5463.62 MB/sec
plain decode 1489.76 MB/sec
plain decode16 708.64 MB/sec
SSSE3 decode 3902.22 MB/sec
SSSE3 decode16 3418.27 MB/sec
SSE41 decode 3892.04 MB/sec
SSE41 decode16 3413.28 MB/sec
SSE42 decode 4933.65 MB/sec
SSE42 decode16 4176.20 MB/sec
AVX decode 5262.62 MB/sec
AVX decode16 4389.26 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 5658.08 MB/sec
AVX2 decode16 4749.03 MB/sec
plain decode 1469.45 MB/sec
plain decode16 693.57 MB/sec
SSSE3 decode 3593.95 MB/sec
SSSE3 decode16 3201.35 MB/sec
SSE41 decode 3575.33 MB/sec
SSE41 decode16 3176.95 MB/sec
SSE42 decode 4510.35 MB/sec
SSE42 decode16 3890.36 MB/sec
AVX decode 4657.02 MB/sec
AVX decode16 4075.72 MB/sec
Direct decoding:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 7054.01 MB/sec
AVX2 decode16 5387.99 MB/sec
plain decode 1476.16 MB/sec
plain decode16 649.31 MB/sec
SSSE3 decode 3873.82 MB/sec
SSSE3 decode16 3752.94 MB/sec
SSE41 decode 3888.66 MB/sec
SSE41 decode16 3753.32 MB/sec
SSE42 decode 5093.97 MB/sec
SSE42 decode16 4744.95 MB/sec
AVX decode 5278.66 MB/sec
AVX decode16 4872.50 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 7094.56 MB/sec
AVX2 decode16 4468.44 MB/sec
plain decode 1480.75 MB/sec
plain decode16 651.41 MB/sec
SSSE3 decode 3674.16 MB/sec
SSSE3 decode16 3771.94 MB/sec
SSE41 decode 3886.23 MB/sec
SSE41 decode16 3767.31 MB/sec
SSE42 decode 4794.23 MB/sec
SSE42 decode16 4772.17 MB/sec
AVX decode 5297.40 MB/sec
AVX decode16 4908.21 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 7028.21 MB/sec
AVX2 decode16 5949.40 MB/sec
plain decode 1479.39 MB/sec
plain decode16 651.18 MB/sec
SSSE3 decode 3885.90 MB/sec
SSSE3 decode16 3764.85 MB/sec
SSE41 decode 3886.18 MB/sec
SSE41 decode16 3764.73 MB/sec
SSE42 decode 5102.43 MB/sec
SSE42 decode16 4479.49 MB/sec
AVX decode 5291.07 MB/sec
AVX decode16 4892.08 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 6911.86 MB/sec
AVX2 decode16 5836.29 MB/sec
plain decode 1476.84 MB/sec
plain decode16 650.35 MB/sec
SSSE3 decode 3911.21 MB/sec
SSSE3 decode16 3749.75 MB/sec
SSE41 decode 3897.19 MB/sec
SSE41 decode16 3757.06 MB/sec
SSE42 decode 5071.97 MB/sec
SSE42 decode16 4717.31 MB/sec
AVX decode 5275.38 MB/sec
AVX decode16 4852.05 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 5657.92 MB/sec
AVX2 decode16 4551.91 MB/sec
plain decode 1438.35 MB/sec
plain decode16 645.72 MB/sec
SSSE3 decode 3585.06 MB/sec
SSSE3 decode16 3345.42 MB/sec
SSE41 decode 3570.98 MB/sec
SSE41 decode16 3349.45 MB/sec
SSE42 decode 4459.54 MB/sec
SSE42 decode16 4073.10 MB/sec
AVX decode 4641.85 MB/sec
AVX decode16 4183.66 MB/sec
I had a gut feeling that having a separate decode16
would be faster, but I never got around to verifying that.
Some questions:
- @mayeut I've started looking at your current POC and while adding extra characters to the SSSE3 decoder implementation was straightforward (pretty much copy and paste), I'm a bit confused as to how I'd go about making such adjustments to the SSE4.2 decoder implementation for example. I specifically need to add
-
and_
as allowed characters. In the case of SSE4.2, is it just a matter of replacing the two0
positions inlut
and replacing two sets of the'+'
entries inrange
? - What do you think about my invalid character removal/ignore strategy? Does it seem optimal or do you know of a better way (perhaps even avoiding
memmove()
/memcpy()
), using (additional) intrinsics or not?
Yes having a direct decode16
is faster but is the speed up worth adding it compared to the 2-stages approach ? Maybe not (and that's for @aklomp to decide given he'll be the one supporting this).
- I'm not sure I'm understanding what you're trying to achieve with those two additional characters. If you want them to be treated as valid then yes, you can add those entries to
range
. For thelut
part, it's a bit more complicated but it should be possible given there's still room for 2 characters (yes the0
positions inlut
). You will still need to update the indices logic for those new values. - I don't think it will be any faster than handling this in
dec_tail.c
. At least for sparse occurrences of invalid characters.
@mayeut Thanks, I ended up using blend to (first) replace instances of _
with /
, since _
has a much higher value, so it doesn't really work with the initial indices computation.
I also noticed codec->dec16 = ....
was missing for all of the non-neon codecs in your 16-bit decoder branch. After adding those and re-activating the codec->dec16
conditional in the main base64 decoding function, I saw a similar further performance boost as you previously showed.
It would be nice to see the built-in 16-bit decoding merged in the main repo here, but I understand if you feel like it may not be common enough to warrant its inclusion.
At any rate, I want to thank you all for your input, guidance, and patience :-)
@mayeut while I think your proof-of-concept is remarkable (how are you so damn fast? :) I'm hesitant to merge something like it for the reasons mentioned above. In my opinion, Base64 is an algorithm that works on octets and produces octets. Limited scope. uint16_t
strings are not octets, so they must be preprocessed before running regular Base64. I see the preprocessing as a separate concern that the component that runs regular Base64 should not have to address.
(It's easier to argue this when the code and API changes involved to support uint16_t
as a first-class datatype are quite extensive, as the POC demonstrates.)
On the other hand, I think that skipping whitespace and newlines is a good candidate for inclusion into this library, because invalid characters are a runtime property of the data instead of a compile-time property of the data's type. Skipping invalid characters is also a feature that has been talked about in the past.