acoustid-server icon indicating copy to clipboard operation
acoustid-server copied to clipboard

Some acoustid-server replication files corrupted

Open adamansky opened this issue 11 years ago • 1 comments

Hi Lukáš! I've working on acoustid replication script. And found invalid replication dump: http://data.acoustid.org/replication/acoustid-update-4620.xml.bz2

xml.sax.parse failed on this particular replication set

xmllint --format ./acoustid-update-4620.xml  
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 31
">Séries</column><column name="artist">Television</column><column name="track">
                                                                               ^
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 4
column name="artist">Television</column><column name="track">xœcpJLOILQHNLKUÀ

It may be bug in xml.etree.cElementTree (used in export_tables.py) but xml ecaping should be performed well during xml generation as shown in sample:

>>> r = etree.Element('test')
>>> r.text = u'bla &'
>>> etree.tostring(r, encoding="UTF-8")
"<?xml version='1.0' encoding='UTF-8'?>\n<test>bla &amp;</test>"

Simple repair solution:

tidy -xml  -o ./acoustid-update-4620-fixed.xml ./acoustid-update-4620.xml

adamansky avatar Aug 28 '12 10:08 adamansky

The problem is that there are some weird characters in the meta table, including 0x04 and 0x06 ASCII control characters. Those are obviously not XML compatible, but xml.etree accepted them and printed directly to the output. It seems that lxml.etree would raise an exception.

acoustid=> select * from meta where id=3454685;
-[ RECORD 1 ]+---------------------------------
id           | 3454685
track        | \x1FxœcpJLOILQHNLKUÀ\x04 xƒ\x06C
artist       | Television
album        | Séries
album_artist | 
track_no     | 
disc_no      | 
year         | 2000

There is no reason why such characters should be there, so I guess I'll have to add some extra validation and I'll also switch to lxml. It seems more reliable.

lalinsky avatar Aug 28 '12 11:08 lalinsky