wikiclean icon indicating copy to clipboard operation
wikiclean copied to clipboard

Dump now contain a `bytes` attribute which breaks parsing.

Open d2a-raudenaerde opened this issue 4 years ago • 2 comments

For example: <text bytes="28961" xml:space="preserve"> This makes the cleaner return no content.

d2a-raudenaerde avatar May 21 '20 08:05 d2a-raudenaerde

I think there was a bytes before...at least according to the source code. The problem is that there's a regex that expects the xml:space... before the bytes element.

This just hit me, and it is a complete showstopper for wikiclean. I've added preprocessing hackery for now. :D

tballison avatar May 28 '20 17:05 tballison

Unfortunately, I'm tied up with EMNLP deadlines until next week... but PR welcome?!

lintool avatar May 28 '20 18:05 lintool