Dictionary icon indicating copy to clipboard operation
Dictionary copied to clipboard

Error running ./WiktionarySplitter.sh

Open Gitsaibot opened this issue 6 years ago • 8 comments

I get always this error when I try to run ./WiktionarySplitter.sh. What can I do to avoid this ? I use a debian 9 system.

endPage: hoggeries, count=2800000 title with colon: Reconstruction:Proto-Germanic/bikjǭ Exception during parse, lastPageTitle=testamentation, titleBuilder=naseinai Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 94192868; columnNumber: 8; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:105) at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source) ... 11 more Error writing to file java.io.IOException: Write end dead Error writing to file java.io.IOException: Write end dead

Gitsaibot avatar Jan 06 '18 10:01 Gitsaibot

Best I can tell from the messages, the XML file is not valid UTF-8. Maybe a newer/different version of xerces can help making it less picky, but I doubt it.

rdoeffinger avatar Jan 06 '18 12:01 rdoeffinger

Do I have to run ./WiktionarySplitter.sh if I use my own DE-EN.txt file or can I generate it directly ? I found test files in DictionaryPC which I want to try...

Gitsaibot avatar Jan 07 '18 09:01 Gitsaibot

You only need WiktionarySplitter (and even the download scripts for downloading wiktionary data) only if you actually want to use the data from Wiktionary. So I guess the answer should be "no".

rdoeffinger avatar Jan 08 '18 10:01 rdoeffinger

I'm getting a similar issue:

$ ./WiktionarySplitter.sh 
(...)
title with colon: Reconstruction:Proto-Iranian/páyHah
title with colon: Reconstruction:Proto-Germanic/marjaną
title with colon: Reconstruction:Proto-Indo-Iranian/mazǰʰás
title with colon: Reconstruction:Sanskrit/स्यालभार्या
title with colon: Reconstruction:Old Persian/𐎲𐎡𐎺𐎼
Exception during parse, lastPageTitle=femsplained, titleBuilder=Waidbruck of file data/inputs/enwiktionary-pages-articles.xml
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/home/redacted/dev/DictionaryPC/data/inputs/enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:330)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:113)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
	... 11 more
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead
Error writing to file java.io.IOException: Write end dead

kenden avatar Jun 06 '19 16:06 kenden

I guess the best you can do is to fix this encoding: enwiktionary-pages-articles.xml; lineNumber: 184649932; columnNumber: 20; Invalid byte 2 of 4-byte UTF-8 sequence And ideally reporting the issue to wiktionary as it seems they have broken data... I haven't checked if the XML parser can somehow be configured to be more permissive, but I suspect it's not possible unfortunately. Running iconv from UTF-8 to UTF-8 on the XML file might work as well to clean up the broken encoding.

rdoeffinger avatar Jun 16 '19 08:06 rdoeffinger

Really, really short answer: Wiktionary really ought to run XML validation on their data, in which case they would catch and fix this themselves instead of us having to deal with bad data...

rdoeffinger avatar Jun 16 '19 08:06 rdoeffinger

This time I got this error at random times when running dictionary generation multiple times. That should mean there is some thread synchronization or other race condition issue here? I.e. not related to the wiktionary data itself at least in that case...

rdoeffinger avatar Apr 11 '20 23:04 rdoeffinger

I think it might be fixed actually... I've run it quite a few times and not seen this anymore. If anyone is still interested, can you test as well? Otherwise I might close this ticket.

rdoeffinger avatar Apr 25 '20 09:04 rdoeffinger