dav icon indicating copy to clipboard operation
dav copied to clipboard

Mangles vcards with diacritics

Open mbiebl opened this issue 2 years ago • 10 comments

I have a vCard like the following:

BEGIN:VCARD
VERSION:3.0
UID:F0-F3-D8-A4-EC-0A-34-F2-B2-AE-53-35-FB-F0-57-C5
N:Hatschek;ž;;;
FN:ž Hatschek
END:VCARD

The problem is in https://github.com/sabre-io/dav/blob/master/lib/DAV/StringUtil.php#L78

For some reason mb_detect_encoding() thinks the file is in ISO-8859-1 format and the re-encodes it as UTF-8. This results in ž being turned into ž

mbiebl avatar Jun 22 '22 13:06 mbiebl

should be fixed with - https://github.com/sabre-io/dav/pull/1404

@mbiebl can you test?

DeepDiver1975 avatar Jun 22 '22 15:06 DeepDiver1975

#1404 seems to fix this specific issue. I don't get a mangled/re-encoded vcard anymore.

mbiebl avatar Jun 22 '22 16:06 mbiebl

The result is weird though: mb_check_encoding($input, 'UTF-8') on the above file is true. mb_check_encoding($input, 'ISO-8859-1') on the above file reports true as well.

mbiebl avatar Jun 22 '22 17:06 mbiebl

I.e. the above data is not ISO-8859-1, or is it?

mbiebl avatar Jun 22 '22 17:06 mbiebl

@DeepDiver1975 I'm a bit sceptical about the results of mb_*_encoding and the automatic re-encoding.

How do you feel about making this re-enconding optional?

mbiebl avatar Jun 23 '22 17:06 mbiebl

I made a comment here about the ambiguities of UFT-8 and ISO-8859-1 https://github.com/sabre-io/http/pull/182#issuecomment-1165307289

I merged PR #1404 - IMO this provides the "best"/reasonable way to handle strings for which we do not know the encoding. If it parses as a valid UTF-8 string, then treat it is UFT-8, otherwise if it parses as ISO-8859-1 then convert it to UTF-8.

phil-davis avatar Jun 27 '22 08:06 phil-davis

Yeah, #1404 certainly improves the situation. That said, if you are in a situation where you can control the input data and know it's valid UTF-8, avoiding the detection-and-automatic-re-encoding would prove to be useful, I think.

mbiebl avatar Jun 27 '22 10:06 mbiebl

That said, if you are in a situation where you can control the input data and know it's valid UTF-8, avoiding the detection-and-automatic-re-encoding would prove to be useful, I think.

The new code checks !mb_check_encoding($input, 'UTF-8') && mb_check_encoding($input, 'ISO-8859-1')

If mb_check_encoding agrees that it looks like valid UTF*, then the logic evaluation if the if statement will stop there. It goes to the else and returns the input string without encoding/decoding it.

phil-davis avatar Jun 27 '22 10:06 phil-davis

Well, that assumes that mb_check_encoding($input, 'UTF-8') always correctly detects UTF-8.

mbiebl avatar Jun 27 '22 11:06 mbiebl

@phil-davis thanks for merging #1404 and preparing the release and @come-nc for investigating and fixing the issue(s)!

mbiebl avatar Jun 27 '22 11:06 mbiebl

seems to be fixed -> close

DeepDiver1975 avatar Nov 09 '23 17:11 DeepDiver1975