mbedtls Validate UTF-8 in DNs

Description

fix #7927

requires #8025 preceding

PR checklist

Please tick as appropriate and edit the reasons (e.g.: "backport: not needed because this is a new feature")

[x] changelog provided
[x] backport not required
[x] tests provided

Notes for the submitter

Please refer to the contributing guidelines, especially the checklist for PR contributors.

Aug 23 '23 14:08 AgathiyanB

Any updates on this?

May 11 '24 02:05 Anutrix

I'm still skeptical that validating UTF-8 encodings is worth the effort. The benefits are quite limited. For a library that's intended to work on very small systems, the code size increase is non-negligible.

May 12 '24 19:05 gilles-peskine-arm

I should explain some background to this. Previously, we have clobbered all non-ASCII characters, replacing them with ? in output and refusing to accept them as input, which caused problems for certificates with UTF-8 (See #3865, #3413).

After the merge of #8025 we accept escaped hex pairs (\AB, \6F etc) and also output non-ASCII bytes in this way. This at least allows applications to work around our lack of support to use certificates with (e.g.) non-english names with their own code.

Adding UTF-8 validation will allow us to accept plain ordinary UTF-8 strings and remove the strict ASCII requirements (i.e. no need for the special characters to be escaped into ASCII) which will simplify things a lot for the ordinary user, avoiding them having to write code like the following:

size_t number_of_non_ascii_chars = count_non_ascii_chars(input_dn);
// TODO check for overflow
size_t escaped_len = input_dn_len + (number_of_non_ascii_chars * 3);
char *escaped = (char *) malloc(escaped_len);

escape_all_non_ascii_chars(escaped, escaped_len, input_dn, input_dn_len);
mbedtls_x509write_crt_set_subject_name(ctx, escaped);

Re code size, this feature seems to add around 250 bytes on x86 (not sure of the numbers on Arm, I'll try to get some if I have time). This seems okay given that we haven't put much active effort into measuring / optimising the x509 code as a whole anyway, and it seems likely that the size can be reduced.

For the full context as well as a list of previous user requests about this, see #6785.

May 13 '24 10:05 davidhorstmann-arm

(Sorry, slipped and hit the 'close' button by accident)

May 13 '24 10:05 davidhorstmann-arm

There are three parts to UTF-8 support:

Allowing non-ASCII characters.
Rejecting invalid UTF-8 sequences.
Rejecting non-canonical Unicode code point sequences.

We don't need (2) or (3) to do (1). Where I'm skeptical is whether we should do (2).

Rejecting invalid UTF-8 is mostly functional compliance. The security concerns are indirect: we might accept invalid UTF-8 which will then cause a problem in some other code. The risk is not that bad and I'm not convinced it's worth the code size for us.

I'm more worried about (3), where the cost is bigger (Unicode canonicalization is not something you want on a microcontroller), but the risk is also much bigger: non-canonical names is a way to usurp a name.

May 13 '24 12:05 gilles-peskine-arm

I'm more worried about (3), where the cost is bigger (Unicode canonicalization is not something you want on a microcontroller), but the risk is also much bigger: non-canonical names is a way to usurp a name.

That makes sense: if we validate UTF-8 encoding, users may assume that we also validate Unicode normalization and avoid doing it themselves. In that case I agree that the best we can do it just accept any byte sequence and loudly document that no validation is performed.

In that case all that is needed is to remove the ASCII-only checks that we currently have.

May 13 '24 17:05 davidhorstmann-arm

Having said that, the standards don't require us to create certificates with normalized UTF-8, only to normalize before comparing names (which we already document that we don't do). But the standards do require that we create certificates with valid UTF-8 in any UTF8String type.

We could justifiably validate the encoding on certificate creation but not perform normalization during comparison. I suppose the main risk would be that users just assume we 'support Unicode' and then fall afoul of the lack of normalization.

May 13 '24 17:05 davidhorstmann-arm

mbedtls mbedtls copied to clipboard

Validate UTF-8 in DNs

Description

PR checklist

Notes for the submitter

mbedtls
mbedtls copied to clipboard