DictionaryPC icon indicating copy to clipboard operation
DictionaryPC copied to clipboard

genv6.sh: java.lang.IllegalArgumentException: newLimit < 0: (-8 < 0) when building EN.quickdic

Open Moonbase59 opened this issue 4 years ago • 13 comments

I’m trying to rebuild the dictionaries. Unfortunately, I have a Tolino and they still can’t use the v007 format, so I used "genv6.sh". It works correctly through all EN-XX dictionaries but when finally trying to build EN.quickdic (v006), I get an error:

Exception in thread "main" java.io.IOException: RuntimeException loading dictionary
	at com.hughes.android.dictionary.engine.Dictionary.<init>(Dictionary.java:115)
	at com.hughes.android.dictionary.engine.ConvertToV6.main(ConvertToV6.java:50)
	at com.hughes.android.dictionary.engine.Runner.main(Runner.java:31)
Caused by: java.lang.IllegalArgumentException: newLimit < 0: (-8 < 0)
	at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
	at java.base/java.nio.Buffer.limit(Buffer.java:346)
	at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
	at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
	at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
	at com.hughes.util.DataInputBuffer.slice(DataInputBuffer.java:49)
	at com.hughes.util.raf.RAFList.<init>(RAFList.java:81)
	at com.hughes.util.raf.RAFList.create(RAFList.java:164)
	at com.hughes.android.dictionary.engine.Dictionary.<init>(Dictionary.java:112)
	... 2 more

This happens:

  • with both my compiled version and using your binary Linux version "DictionaryPC"
  • on both a 8GB RAM Linux Mint 20.2 and a 32GB RAM Ubuntu Studio 14.04 machine

Looking through the v007 .quickdic outputs, the EN.quickdic is by far the largest, it has 155,5MB.

Any ideas/help?

Moonbase59 avatar Dec 07 '21 12:12 Moonbase59

I can't say anything about the why, but the problem here is not the conversion to v6 but that the v7 dictionary is broken. Can't tell much more with the info available I'm afraid. I guess a really basic and slightly silly suggestion might be to try regenerating the dictionary, just in case it was a one-off glitch.

rdoeffinger avatar Dec 08 '21 22:12 rdoeffinger

Tried that a few times, but I suppose you’re right. I guess either the WiktionarySplitter.sh does something odd, or the Wiktionary data contains something bad. :-(

Moonbase59 avatar Dec 11 '21 00:12 Moonbase59

No, if it's reproducible there's a bug somewhere. The generated dictionary should never be structurally invalid no matter the input data. Can you provide some reproduction information/data? Whether it's which wiktionary input file(s) you used, to the split data or the generated file, something that would allow me to get the same error?

rdoeffinger avatar Dec 11 '21 20:12 rdoeffinger

I’ll try to create a reproducable setup, this is of course debatable, since I needed to install several dependencies and change the scripts accordingly. I’m using Linux Mint 20.2, mainly.

Moonbase59 avatar Dec 12 '21 11:12 Moonbase59

I think I mostly need to be sure to have the same input and the same commands. If that's not enough to reproduce it might be time to consider it might be an issue with versions of dependencies or something like that. Also as said the dictionary file that fails to convert might be enough to get an idea of what went wrong. A quick look of the code ended with just a "I see no way this can happen", so not getting any further there.

rdoeffinger avatar Dec 12 '21 12:12 rdoeffinger

Let’s see, what did I do (apart from installing missing dependencies)?

  • Ran data/downloadInputs.sh

ls -l /usr/share/java/icu* shows:

-rw-r--r-- 1 root root 12364629 Feb 25  2019 /usr/share/java/icu4j-60.2.jar
-rw-r--r-- 1 root root  3129482 Feb 25  2019 /usr/share/java/icu4j-charset-60.2.jar
lrwxrwxrwx 1 root root       22 Feb 25  2019 /usr/share/java/icu4j-charset.jar -> icu4j-charset-60.2.jar
lrwxrwxrwx 1 root root       14 Feb 25  2019 /usr/share/java/icu4j.jar -> icu4j-60.2.jar
-rw-r--r-- 1 root root    55599 Feb 25  2019 /usr/share/java/icu4j-localespi-60.2.jar
lrwxrwxrwx 1 root root       24 Feb 25  2019 /usr/share/java/icu4j-localespi.jar -> icu4j-localespi-60.2.jar

and ls -l /usr/share/java/commons* shows:

lrwxrwxrwx 1 root root     15 Jun 19  2017 /usr/share/java/commons-cli-1.4.jar -> commons-cli.jar
-rw-r--r-- 1 root root  47653 Jun 19  2017 /usr/share/java/commons-cli.jar
lrwxrwxrwx 1 root root     17 Jan 27  2020 /usr/share/java/commons-codec-1.14.jar -> commons-codec.jar
-rw-r--r-- 1 root root 346621 Jan 27  2020 /usr/share/java/commons-codec.jar
lrwxrwxrwx 1 root root     20 Jan 27  2020 /usr/share/java/commons-compress-1.19.jar -> commons-compress.jar
-rw-r--r-- 1 root root 602109 Jan 27  2020 /usr/share/java/commons-compress.jar
lrwxrwxrwx 1 root root     14 Sep 23 17:57 /usr/share/java/commons-io-2.6.jar -> commons-io.jar
-rw-r--r-- 1 root root 211957 Sep 23 17:57 /usr/share/java/commons-io.jar
lrwxrwxrwx 1 root root     17 Nov 26  2018 /usr/share/java/commons-lang3-3.8.jar -> commons-lang3.jar
-rw-r--r-- 1 root root 495121 Nov 26  2018 /usr/share/java/commons-lang3.jar
-rw-r--r-- 1 root root  58179 Jan 12  2018 /usr/share/java/commons-logging-1.2.jar
-rw-r--r-- 1 root root  22029 Jan 12  2018 /usr/share/java/commons-logging-adapters-1.2.jar
lrwxrwxrwx 1 root root     32 Jan 12  2018 /usr/share/java/commons-logging-adapters.jar -> commons-logging-adapters-1.2.jar
-rw-r--r-- 1 root root  49531 Jan 12  2018 /usr/share/java/commons-logging-api-1.2.jar
lrwxrwxrwx 1 root root     27 Jan 12  2018 /usr/share/java/commons-logging-api.jar -> commons-logging-api-1.2.jar
lrwxrwxrwx 1 root root     23 Jan 12  2018 /usr/share/java/commons-logging.jar -> commons-logging-1.2.jar
lrwxrwxrwx 1 root root     16 Nov  4  2019 /usr/share/java/commons-text-1.8.jar -> commons-text.jar
-rw-r--r-- 1 root root 200726 Nov  4  2019 /usr/share/java/commons-text.jar

For compilation (although I later used your Linux binary), I modified the beginning of compile.sh like so:

ICU4J=/usr/share/java/icu4j-49.1.jar
test -r "$ICU4J" || ICU4J=/usr/share/icu4j-55/lib/icu4j.jar
test -r "$ICU4J" || ICU4J=/usr/share/java/icu4j-60.2.jar
JUNIT=/usr/share/java/junit.jar
test -r "$JUNIT" || JUNIT=/usr/share/junit/lib/junit.jar
COMMONS=/usr/share/java/commons-text.jar
COMMONS_COMPRESS=/usr/share/java/commons-compress.jar

Also, at least my javac needs -source and -target (no double-hyphen):

javac -source 11 -target 11 --limit-modules java.xml,java.logging -Xlint:all -encoding UTF-8 -g -d bin/ ../Dictionary/Util/src/com/hughes/util/*.java ../Dictionary/Util/src/com/hughes/util/raf/*.java ../Dictionary/src/com/hughes/android/dictionary/DictionaryInfo.java ../Dictionary/src/com/hughes/android/dictionary/engine/*.java ../Dictionary/src/com/hughes/android/dictionary/C.java src/com/hughes/util/*.java src/com/hughes/android/dictionary/*.java src/com/hughes/android/dictionary/*/*.java src/com/hughes/android/dictionary/*/*/*.java -classpath "$ICU4J:$JUNIT:$COMMONS:$COMMONS_COMPRESS"
  • Ran ./compile.sh
  • Got Linux binary DictionaryPC since I assumed there might be something wrong with my compilation (and possibly ICU versions etc.)

Now using your binary …

  • Ran ./WiktionarySplitter.sh
  • Ran ./generate_dictionaries.sh, which seemingly created all v7 dictionaries. (I did not modify it to exclude any files, so it was supposed to create all dictionaries.)

Btw, this part in generate_dictionaries.sh made me bring up "reverse dictionaries":

reverse_dicts=""
if test "$lang" = "DE" -o "$lang" = "FR" -o "$lang" = "IT" ; then
reverse_dicts="--input3=data/inputs/wikiSplit/$langcode/EN.data --input3Format=WholeSectionToHtmlParser --input3Name=${langcode}wikitionary --input3WiktionaryLang=$lang --input3TitleIndex=1 --input3WebUrlTemplate=http://${langcode}.wiktionary.org/wiki/%s"
#reverse_dicts="$reverse_dicts --input4=data/inputs/wikiSplit/$langcode/EN.data --input4Name=${langcode}wikitionary --input4Format=enwiktionary --input4LangPattern=${enlangname} --input4LangCodePattern=en --input4EnIndex=1 --input4WiktionaryType=EnForeign"
fi
  • Ran ./genv6.sh, which created lots of v6 dictionaries (including all EN-xx) until it arrived at above error when trying to create the EN.dic.

Let me know if I must do all this again, since it takes many, many hours…

Moonbase59 avatar Dec 12 '21 13:12 Moonbase59

I don't think there is a point doing it again, but downloadInputs always downloads whichever is latest, so it would be good to know the exact database dump date. But since they are regularly deleted, that might not work. Maybe you could upload your wikiSplit/en/EN.data somewhere? And as mentioned, the v7 dictionary you got might help as well.

rdoeffinger avatar Dec 12 '21 13:12 rdoeffinger

The date was 2021-12-07, and here are the files:

Moonbase59 avatar Dec 12 '21 13:12 Moonbase59

Btw you know that you can edit just the first few lines of the generate_dictionaries.sh file to only generate this failing dictionary? Would look like this: DE_DICTS=true DE_DICTS=false EN_DICTS=true EN_DICTS=false FR_DICTS=true FR_DICTS=false IT_DICTS=true IT_DICTS=false EN_TRANS_DICTS=true EN_TRANS_DICTS=false SINGLE_DICTS="en de fr it es pt" SINGLE_DICTS="en"

The problem here is indeed that the EN.quickdic file is incomplete: A valid file would always end with the text "END OF DICTIONARY", but in this case actually the whole index is missing. I've not gotten around to check why that might have happened.

rdoeffinger avatar Dec 16 '21 12:12 rdoeffinger

Using your EN.data.gz I did get a working EN.quickdic. Maybe you could re-try the generate_dictionaries.sh step (for only this one, should take very little time) and check the output for anything suspicious?

rdoeffinger avatar Dec 16 '21 13:12 rdoeffinger

Ok, I changed generate_dictionaries.sh as suggested above, and did a

./generate_dictionaries.sh &> gen-EN.log

I attach gen-EN.log. In spite of generating LOTS of warnings and errors, and data/inputs/wikiSplit/en/EN.data.gz still being the same file from 2021-12-07, a v7 dict is now generated, and roughly 20 MB larger. And it has "END OF DICTIONARY" at the end.

So there must be some difference between running the "all-enabled" and the "one file only" generate_dictionaries.sh process.

gen-EN.log.zip

EDIT: ./genv6.sh also runs through on this newly-created v7 EN.quickdic.

Moonbase59 avatar Dec 17 '21 10:12 Moonbase59

The log of a working run will not really help :) You can check the script, it runs exactly the same command if you generate one or many dictionaries. However the script does not really have much error checking. My best guess (but still just a pretty arbitrary one) would be that something temporarily used a lot of RAM or storage and caused that one run to crash. Especially with older Java versions it can use up all the 8 GB max heap when generating the dictionary.

rdoeffinger avatar Dec 17 '21 16:12 rdoeffinger

Yeah, I guess RAM might have been "it"—this laptop only has 8 GB RAM (4 cores, 8 threads total) and the whole process was pretty much eating resources when generating all dictionaries.

Moonbase59 avatar Dec 17 '21 17:12 Moonbase59