whipper
whipper copied to clipboard
CD-Text cannot be parsed, raising ValueErrors
I have several CDs that have possibly invalid CD-Text.
Whipper will raise this error when parsing it:
u"(ValueError('Trailing \\ in string',)
I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:
Read: key: u'TITLE', value: u'\267|\246\250\245\'
(The TOC contains every slash once, not twice, but this is the representation of the string printed)
The simple fix would be to strip trailing slashes. I will do that now and test it.
This is code in question (image/toc.py):
value = value.decode('string-escape').decode('latin-1')
This patch fixes it for me:
diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
if m:
key = m.group('key')
value = m.group('value')
+
+ # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+ # slash. This will case value.decode('string-escape') to fail
+ # with a ValueError. We could -catch- that exception, but it
+ # might be more clean to just strip the trailing slash, as that
+ # seems to be the main/only issue right now.
+ while value.endswith('\\'):
+ value = value[:-1]
+
# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
Maybe we can just merge this? Seems useful and also harmless.
Maybe the double \
is there because of something related to this??
The 12 payload bytes contain pieces of zero terminated data. When double-byte text is used the zero is a double byte, otherwise it is a single ASCII NUL.
https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents
@MerlijnWajer
Found the explanation:
- https://sourceforge.net/p/cdrdao/mailman/message/27539453/
- https://sourceforge.net/p/cdrdao/mailman/message/28713083/
CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).
The character code is defined as follows:
$00 = ISO/IEC 8859-1 (modified, see CD EXTRA Specification, appendix 1)
$01 = ISO/IEC 646, ASCII (7 bit)
$02 .. $7F = Reserved
$80 = Music Shift-JIS Kanji
$81 = Korean character code (to be defined)
$82 = Mandarin Chinese character code (to be defined)
$83 .. $FF = Reserved
The character code indicates the character set used to code the character strings of the
PACKs with ID1 = $80 through $85. Other PACKs shall have character code $00 (ISO 8859-1
modified).
What's the purpose of the string-escape
decoding step?
To fix this issue, I think we should just escape the trailing backslash character (if present).