acoustid-server
acoustid-server copied to clipboard
Some acoustid-server replication files corrupted
Hi Lukáš! I've working on acoustid replication script. And found invalid replication dump: http://data.acoustid.org/replication/acoustid-update-4620.xml.bz2
xml.sax.parse
failed on this particular replication set
xmllint --format ./acoustid-update-4620.xml
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 31
">Séries</column><column name="artist">Television</column><column name="track">
^
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 4
column name="artist">Television</column><column name="track">xœcpJLOILQHNLKUÀ
It may be bug in xml.etree.cElementTree
(used in export_tables.py) but xml ecaping should be performed well during xml generation as shown in sample:
>>> r = etree.Element('test')
>>> r.text = u'bla &'
>>> etree.tostring(r, encoding="UTF-8")
"<?xml version='1.0' encoding='UTF-8'?>\n<test>bla &</test>"
Simple repair solution:
tidy -xml -o ./acoustid-update-4620-fixed.xml ./acoustid-update-4620.xml
The problem is that there are some weird characters in the meta table, including 0x04 and 0x06 ASCII control characters. Those are obviously not XML compatible, but xml.etree
accepted them and printed directly to the output. It seems that lxml.etree
would raise an exception.
acoustid=> select * from meta where id=3454685;
-[ RECORD 1 ]+---------------------------------
id | 3454685
track | \x1FxœcpJLOILQHNLKUÀ\x04 xƒ\x06C
artist | Television
album | Séries
album_artist |
track_no |
disc_no |
year | 2000
There is no reason why such characters should be there, so I guess I'll have to add some extra validation and I'll also switch to lxml. It seems more reliable.