wikiclean
wikiclean copied to clipboard
Dump now contain a `bytes` attribute which breaks parsing.
For example:
<text bytes="28961" xml:space="preserve">
This makes the cleaner return no content.
I think there was a bytes
before...at least according to the source code. The problem is that there's a regex that expects the xml:space...
before the bytes
element.
This just hit me, and it is a complete showstopper for wikiclean. I've added preprocessing hackery for now. :D
Unfortunately, I'm tied up with EMNLP deadlines until next week... but PR welcome?!