dav
dav copied to clipboard
Mangles vcards with diacritics
I have a vCard like the following:
BEGIN:VCARD
VERSION:3.0
UID:F0-F3-D8-A4-EC-0A-34-F2-B2-AE-53-35-FB-F0-57-C5
N:Hatschek;ž;;;
FN:ž Hatschek
END:VCARD
The problem is in https://github.com/sabre-io/dav/blob/master/lib/DAV/StringUtil.php#L78
For some reason mb_detect_encoding() thinks the file is in ISO-8859-1 format and the re-encodes it as UTF-8. This results in ž being turned into ž
should be fixed with - https://github.com/sabre-io/dav/pull/1404
@mbiebl can you test?
#1404 seems to fix this specific issue. I don't get a mangled/re-encoded vcard anymore.
The result is weird though:
mb_check_encoding($input, 'UTF-8')
on the above file is true.
mb_check_encoding($input, 'ISO-8859-1')
on the above file reports true as well.
I.e. the above data is not ISO-8859-1, or is it?
@DeepDiver1975 I'm a bit sceptical about the results of mb_*_encoding and the automatic re-encoding.
How do you feel about making this re-enconding optional?
I made a comment here about the ambiguities of UFT-8
and ISO-8859-1
https://github.com/sabre-io/http/pull/182#issuecomment-1165307289
I merged PR #1404 - IMO this provides the "best"/reasonable way to handle strings for which we do not know the encoding. If it parses as a valid UTF-8 string, then treat it is UFT-8, otherwise if it parses as ISO-8859-1 then convert it to UTF-8.
Yeah, #1404 certainly improves the situation. That said, if you are in a situation where you can control the input data and know it's valid UTF-8, avoiding the detection-and-automatic-re-encoding would prove to be useful, I think.
That said, if you are in a situation where you can control the input data and know it's valid UTF-8, avoiding the detection-and-automatic-re-encoding would prove to be useful, I think.
The new code checks
!mb_check_encoding($input, 'UTF-8') && mb_check_encoding($input, 'ISO-8859-1')
If mb_check_encoding
agrees that it looks like valid UTF*, then the logic evaluation if the if
statement will stop there. It goes to the else
and returns the input string without encoding/decoding it.
Well, that assumes that mb_check_encoding($input, 'UTF-8')
always correctly detects UTF-8.
@phil-davis thanks for merging #1404 and preparing the release and @come-nc for investigating and fixing the issue(s)!
seems to be fixed -> close