base64
base64 copied to clipboard
[Question] Invalid input data handling ?
I'm not sure how "invalid" input data shall be processed. Could this be clarified ?
A few samples using base64_decode
, what's the expected output for those ?
- "Zm9vYg=" decodes successfully though it's not properly terminated
- "Zm9vYg" decodes successfully though it's not properly terminated
- "Zm9vY" decodes successfully though it misses data
- "Zm9vYmF=Zm9v" decodes successfully though it contains too much data or corruption occurred.
This could be extended to streaming API, but I thought I'd start with the simple block decode.
The compliance baseline for this library should be the relevant RFC 4648. I never liked Postel's law much ("be strict in what you send, and liberal in what you accept"). Being liberal just invites abuse. All your examples should throw an error when they violate the preconditions.
I'll check the RFC 4648
and try to make a PR to update the test suite and library to handle invalid input properly.
Thanks! Let's have a library that can at least strictly validate. If at any later moment we want to relax the validation, we could create a flag for it.
(Personally, I think that any string that violates the base64 spec as given in the RFC, is not a valid base64 string, and should not be given the benefit of the doubt.)
Maybe one comment: it's quite common for base64 input to contain whitespace (newlines and such), and it would be helpful to have a flag to tell the library to ignore it. This does not negate the fact that a bona-fide base64 string with such characters is invalid, but it would indicate that it would be useful to have a flag that tells the library to strip out those characters before processing. (See also #15).
So, not much else than what you're saying after reviewing RFC 4648
.
First step, report all violations. Second step, allow whitespace. See #33
For now, I added validation for base64_decode
function only.
The streaming implementation would require a final step (as can be seen in validation for base64_decode
). This requires adding a new function in the API or asking integrators to do the same check as what's done for base64_decode
.
I'm more in favor of adding a new API function. Could be something like:
int
base64_stream_decode_final (struct base64_state *state)
{
state->eof = BASE64_EOF;
if (state->bytes == 0) {
return 1;
}
return 0;
}
Hm, I'd like to avoid changing the API if possible. Your suggested addition basically checks if there are no trailing characters in the decode buffer. I'm not sure I consider that something that we must intercept and note. My attitude would be "as long as we honour the EOF marker and output no further bytes, who cares about trailing bytes". As far as decoding is concerned, those bytes are neither here or there, and it seems pedantic to point out to the user that his buffer length calculation is "optimistic".
What do you think? Maybe the right way is to be pedantic and technically correct.
Well, if I quote what you said in an earlier comment:
The compliance baseline for this library should be the relevant RFC 4648. I never liked Postel's law much ("be strict in what you send, and liberal in what you accept"). Being liberal just invites abuse. All your examples should throw an error when they violate the preconditions.
I'd say that the library should throw an error even when there are trailing bytes remaining. For sure, it shall be the case for the non streaming API. Since there's no API to get the number of unread bytes, I'd say the library shall return an error in this case (it might be possible to compute this with this version but adding whitespace character stripping will remove that possibility).
I did not understand this:
As far as decoding is concerned, those bytes are neither here or there, and it seems pedantic to point out to the user that his buffer length calculation is "optimistic".
The only thing that will be pointed out to the user is that he's feeding too much (or not enough) data in the decoder.
The API change might be considered an enhancement only. Backward compatibility is still here. If the user wants a stricter behavior then, he can integrate the new function in his code.
The viewpoint changes depending on how you conceptualize the buffer. If you define it as being strictly a Base64-encoded string and nothing more, you get a different viewpoint than when you define it as a buffer that contains a properly-terminated Base64-encoded string starting at offset 0.
In the first case trailing bytes are an error, while in the second case trailing bytes after a valid string are simply irrelevant (which is what I meant with "neither here nor there": they don't matter).
I've been holding more or less the second view of the buffer, whereas you are closer to the first. I don't think my view contradicts what I said earlier about violating the preconditions, because the preconditions are defined for the Base64-encoded string itself and not the buffer it's embedded in. Of your four examples, I believe the first three should throw errors, whereas the fourth is probably OK (but no bytes should be output after seeing the EOF marker).
Concretely, I've assumed that it's not an error to pass an oversized buffer (of "optimistic" length :), as long as the Base64 string starting at offset 0 is valid.
Maybe a stricter check would be a good idea, though. I do still think that it's better to be strict than liberal in these cases.
OK, I see what you meant now. What bothers me with that viewpoint is that, if I'm not mistaken or missing something else, it's only valid for "terminated" Base64-encoded string. That is, for Base64-encoded strings that are multiple of 4 characters, what remains will be decoded (or return an error at some point). Given what I just said, I don't know what are the real use cases for using an "optimistic" length. It's seems to be error prone more than anything else. Still, I might be missing something.