UTF-16 Encoding Issue
Thanks for a great library! I appreciate your hard work on it.
Like in my last issue (which is a distinct encoding issue), I'm running into more trouble with encoding.
Note that I'm calling SetVersion prior to SetDefaultEncoding because of #85
tags.SetVersion(3) // id3v2.3.0
tags.SetDefaultEncoding(id3v2.EncodingUTF16) // UTf-8 would be cool, but 2.3 doesn't support it
tags.SetTitle("Мир! Вашему! Дому! (K-DEF Remix)")
When I run exiftool -v3 -l myfile.mp3, it indicates this for TIT2
| 009c: 01 fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04 [......8.@.!. ...]
| 00ac: 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04 [0.H.5.<.C.!. ...]
| 00bc: 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00 [>.<.C.!. .(.K.-.]
| 00cc: 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00 [D.E.F. .R.e.m.i.]
| 00dc: 78 00 29 00 00 00 [x.)...]
I see that the first 0x01 indicates that we're dealing with UCS-2 with a BOM. Let's disregard that. Then the BOM says it's big endian (fine). I edited out the 0x01 tag in a text editor then pasted it into an interactive python session:
>>> hexstr="""
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 68: truncated data
>>> len(raw)
69
Kinda hard to get to the standard at the moment (it's been like this for a while):
Here's a copy: https://web.archive.org/web/20190207033339/https://id3.org/id3v2.3.0#ID3v2_frame_overview
At least in 2.3.0, it does say that you're supposed to have a null terminator (so, 00 00 for UTF-16/UCS-2). I don't think that you can have a UCS-2 string with an odd number of characters. So it seems like you're appending an additional null character, and the length becomes not a multiple of 2, which makes it not legit UCS-2 or UTF-16.
I'm pasting a screenshot rather than copying the text of the output of id3edit because I think the colors are cool:
So that thinks it's bad too. If I just remove the one extra null byte, python seems happy with it (and other tools):
>>> hexstr = """
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
'Мир! Вашему! Дому! (K-DEF Remix)\x00'
I think this has something to do with other PRs that were meant to fix things for other encodings maybe in 2.4...
This looks relevant: https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/encoding.go#L162-L164
Also, oddly, in your code,
// bom is used in UTF-16 encoded Unicode with BOM.
// See https://en.wikipedia.org/wiki/Byte_order_mark.
var bom = []byte{0xFF, 0xFE}
This is the little-endian BOM. Not big-endian. But your program seems to write big-endian BOM (and big-endian text).
Alright I think I got it:
https://github.com/n10v/id3v2/blob/34286c4b196c8cbf785f3402fd6f127fcb696714/text_frame.go#L24-L33
When I call tf.WriteTo(), it calls bw.EncodeAndWriteText. Follow along, it will add a single 0 -
func (bw *bufWriter) EncodeAndWriteText(src string, to Encoding) {
if bw.err != nil {
return
}
bw.err = encodeWriteText(bw, src, to)
}
encodeWriteText does this:
// encodeWriteText encodes src from UTF-8 to "to" encoding and writes to bw.
func encodeWriteText(bw *bufWriter, src string, to Encoding) error {
if to.Equals(EncodingUTF8) {
bw.WriteString(src)
return nil
}
toXEncoding := resolveXEncoding(nil, to)
encoded, err := toXEncoding.NewEncoder().String(src)
if err != nil {
return err
}
bw.WriteString(encoded)
// Here we go! 💣
if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
bw.WriteByte(0)
}
return nil
}
So at this point, before encodeWriteText is called, it doesn't have the termination characters added. Therefore that if clause: !bytes.HasSuffix([]byte(encoded), []byte{0}) is true. But really that entire if statement is wrong: a single 0 in UTF-16 LE at the end of any ASCII string that's been encoded as UTF-16 will be present. I think this should be removed entirely.
After encodeWriteText is called, now we have a single 0 at the end, which will give us an odd number of bytes (invalid UCS-2/UTF-16). Then back to TextFrame.WriteTo():
func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
return useBufWriter(w, func(bw *bufWriter) {
bw.WriteByte(tf.Encoding.Key)
bw.EncodeAndWriteText(tf.Text, tf.Encoding) // <- this added a single 0
// https://github.com/bogem/id3v2/pull/52
// https://github.com/bogem/id3v2/pull/33
bw.Write(tf.Encoding.TerminationBytes) // <- now we have two more
})
}
There you go, three null bytes at the end. So there's a bug with UTF-16, and that's consistent with the tools I'm using to read it.