whipper icon indicating copy to clipboard operation
whipper copied to clipboard

CD-Text cannot be parsed, raising ValueErrors

Open MerlijnWajer opened this issue 7 years ago • 7 comments

I have several CDs that have possibly invalid CD-Text.

Whipper will raise this error when parsing it:

u"(ValueError('Trailing \\ in string',)

I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:

Read: key: u'TITLE', value: u'\267|\246\250\245\'

(The TOC contains every slash once, not twice, but this is the representation of the string printed)

The simple fix would be to strip trailing slashes. I will do that now and test it.

MerlijnWajer avatar Jun 10 '17 06:06 MerlijnWajer

This is code in question (image/toc.py):

value = value.decode('string-escape').decode('latin-1')

MerlijnWajer avatar Jun 10 '17 06:06 MerlijnWajer

This patch fixes it for me:

diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
             if m:
                 key = m.group('key')
                 value = m.group('value')
+
+                # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+                # slash. This will case value.decode('string-escape') to fail
+                # with a ValueError. We could -catch- that exception, but it
+                # might be more clean to just strip the trailing slash, as that
+                # seems to be the main/only issue right now.
+                while value.endswith('\\'):
+                    value = value[:-1]
+
                 # usually, value is encoded with octal escapes and in latin-1
                 # FIXME: other encodings are possible, does cdrdao handle
                 # them ?

MerlijnWajer avatar Jun 10 '17 07:06 MerlijnWajer

Maybe we can just merge this? Seems useful and also harmless.

MerlijnWajer avatar Jan 07 '18 23:01 MerlijnWajer

Maybe the double \ is there because of something related to this??

The 12 payload bytes contain pieces of zero terminated data. When double-byte text is used the zero is a double byte, otherwise it is a single ASCII NUL.

https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents

JoeLametta avatar Jan 18 '18 14:01 JoeLametta

@MerlijnWajer

Found the explanation:

  • https://sourceforge.net/p/cdrdao/mailman/message/27539453/
  • https://sourceforge.net/p/cdrdao/mailman/message/28713083/

CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).

The character code is defined as follows:

$00         = ISO/IEC 8859-1 (modified, see CD EXTRA Specification, appendix 1)
$01         = ISO/IEC 646, ASCII (7 bit)
$02 .. $7F  = Reserved
$80         = Music Shift-JIS Kanji
$81         = Korean character code (to be defined)
$82         = Mandarin Chinese character code (to be defined)
$83 .. $FF  = Reserved

The character code indicates the character set used to code the character strings of the
PACKs with ID1 = $80 through $85. Other PACKs shall have character code $00 (ISO 8859-1
modified).

JoeLametta avatar Nov 03 '18 12:11 JoeLametta

What's the purpose of the string-escape decoding step? To fix this issue, I think we should just escape the trailing backslash character (if present).

JoeLametta avatar Nov 03 '18 13:11 JoeLametta