genv6.sh: java.lang.IllegalArgumentException: newLimit < 0: (-8 < 0) when building EN.quickdic
I’m trying to rebuild the dictionaries. Unfortunately, I have a Tolino and they still can’t use the v007 format, so I used "genv6.sh". It works correctly through all EN-XX dictionaries but when finally trying to build EN.quickdic (v006), I get an error:
Exception in thread "main" java.io.IOException: RuntimeException loading dictionary
at com.hughes.android.dictionary.engine.Dictionary.<init>(Dictionary.java:115)
at com.hughes.android.dictionary.engine.ConvertToV6.main(ConvertToV6.java:50)
at com.hughes.android.dictionary.engine.Runner.main(Runner.java:31)
Caused by: java.lang.IllegalArgumentException: newLimit < 0: (-8 < 0)
at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
at java.base/java.nio.Buffer.limit(Buffer.java:346)
at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
at com.hughes.util.DataInputBuffer.slice(DataInputBuffer.java:49)
at com.hughes.util.raf.RAFList.<init>(RAFList.java:81)
at com.hughes.util.raf.RAFList.create(RAFList.java:164)
at com.hughes.android.dictionary.engine.Dictionary.<init>(Dictionary.java:112)
... 2 more
This happens:
- with both my compiled version and using your binary Linux version "DictionaryPC"
- on both a 8GB RAM Linux Mint 20.2 and a 32GB RAM Ubuntu Studio 14.04 machine
Looking through the v007 .quickdic outputs, the EN.quickdic is by far the largest, it has 155,5MB.
Any ideas/help?
I can't say anything about the why, but the problem here is not the conversion to v6 but that the v7 dictionary is broken. Can't tell much more with the info available I'm afraid. I guess a really basic and slightly silly suggestion might be to try regenerating the dictionary, just in case it was a one-off glitch.
Tried that a few times, but I suppose you’re right. I guess either the WiktionarySplitter.sh does something odd, or the Wiktionary data contains something bad. :-(
No, if it's reproducible there's a bug somewhere. The generated dictionary should never be structurally invalid no matter the input data. Can you provide some reproduction information/data? Whether it's which wiktionary input file(s) you used, to the split data or the generated file, something that would allow me to get the same error?
I’ll try to create a reproducable setup, this is of course debatable, since I needed to install several dependencies and change the scripts accordingly. I’m using Linux Mint 20.2, mainly.
I think I mostly need to be sure to have the same input and the same commands. If that's not enough to reproduce it might be time to consider it might be an issue with versions of dependencies or something like that. Also as said the dictionary file that fails to convert might be enough to get an idea of what went wrong. A quick look of the code ended with just a "I see no way this can happen", so not getting any further there.
Let’s see, what did I do (apart from installing missing dependencies)?
- Ran
data/downloadInputs.sh
ls -l /usr/share/java/icu* shows:
-rw-r--r-- 1 root root 12364629 Feb 25 2019 /usr/share/java/icu4j-60.2.jar
-rw-r--r-- 1 root root 3129482 Feb 25 2019 /usr/share/java/icu4j-charset-60.2.jar
lrwxrwxrwx 1 root root 22 Feb 25 2019 /usr/share/java/icu4j-charset.jar -> icu4j-charset-60.2.jar
lrwxrwxrwx 1 root root 14 Feb 25 2019 /usr/share/java/icu4j.jar -> icu4j-60.2.jar
-rw-r--r-- 1 root root 55599 Feb 25 2019 /usr/share/java/icu4j-localespi-60.2.jar
lrwxrwxrwx 1 root root 24 Feb 25 2019 /usr/share/java/icu4j-localespi.jar -> icu4j-localespi-60.2.jar
and ls -l /usr/share/java/commons* shows:
lrwxrwxrwx 1 root root 15 Jun 19 2017 /usr/share/java/commons-cli-1.4.jar -> commons-cli.jar
-rw-r--r-- 1 root root 47653 Jun 19 2017 /usr/share/java/commons-cli.jar
lrwxrwxrwx 1 root root 17 Jan 27 2020 /usr/share/java/commons-codec-1.14.jar -> commons-codec.jar
-rw-r--r-- 1 root root 346621 Jan 27 2020 /usr/share/java/commons-codec.jar
lrwxrwxrwx 1 root root 20 Jan 27 2020 /usr/share/java/commons-compress-1.19.jar -> commons-compress.jar
-rw-r--r-- 1 root root 602109 Jan 27 2020 /usr/share/java/commons-compress.jar
lrwxrwxrwx 1 root root 14 Sep 23 17:57 /usr/share/java/commons-io-2.6.jar -> commons-io.jar
-rw-r--r-- 1 root root 211957 Sep 23 17:57 /usr/share/java/commons-io.jar
lrwxrwxrwx 1 root root 17 Nov 26 2018 /usr/share/java/commons-lang3-3.8.jar -> commons-lang3.jar
-rw-r--r-- 1 root root 495121 Nov 26 2018 /usr/share/java/commons-lang3.jar
-rw-r--r-- 1 root root 58179 Jan 12 2018 /usr/share/java/commons-logging-1.2.jar
-rw-r--r-- 1 root root 22029 Jan 12 2018 /usr/share/java/commons-logging-adapters-1.2.jar
lrwxrwxrwx 1 root root 32 Jan 12 2018 /usr/share/java/commons-logging-adapters.jar -> commons-logging-adapters-1.2.jar
-rw-r--r-- 1 root root 49531 Jan 12 2018 /usr/share/java/commons-logging-api-1.2.jar
lrwxrwxrwx 1 root root 27 Jan 12 2018 /usr/share/java/commons-logging-api.jar -> commons-logging-api-1.2.jar
lrwxrwxrwx 1 root root 23 Jan 12 2018 /usr/share/java/commons-logging.jar -> commons-logging-1.2.jar
lrwxrwxrwx 1 root root 16 Nov 4 2019 /usr/share/java/commons-text-1.8.jar -> commons-text.jar
-rw-r--r-- 1 root root 200726 Nov 4 2019 /usr/share/java/commons-text.jar
For compilation (although I later used your Linux binary), I modified the beginning of compile.sh like so:
ICU4J=/usr/share/java/icu4j-49.1.jar
test -r "$ICU4J" || ICU4J=/usr/share/icu4j-55/lib/icu4j.jar
test -r "$ICU4J" || ICU4J=/usr/share/java/icu4j-60.2.jar
JUNIT=/usr/share/java/junit.jar
test -r "$JUNIT" || JUNIT=/usr/share/junit/lib/junit.jar
COMMONS=/usr/share/java/commons-text.jar
COMMONS_COMPRESS=/usr/share/java/commons-compress.jar
Also, at least my javac needs -source and -target (no double-hyphen):
javac -source 11 -target 11 --limit-modules java.xml,java.logging -Xlint:all -encoding UTF-8 -g -d bin/ ../Dictionary/Util/src/com/hughes/util/*.java ../Dictionary/Util/src/com/hughes/util/raf/*.java ../Dictionary/src/com/hughes/android/dictionary/DictionaryInfo.java ../Dictionary/src/com/hughes/android/dictionary/engine/*.java ../Dictionary/src/com/hughes/android/dictionary/C.java src/com/hughes/util/*.java src/com/hughes/android/dictionary/*.java src/com/hughes/android/dictionary/*/*.java src/com/hughes/android/dictionary/*/*/*.java -classpath "$ICU4J:$JUNIT:$COMMONS:$COMMONS_COMPRESS"
- Ran
./compile.sh - Got Linux binary
DictionaryPCsince I assumed there might be something wrong with my compilation (and possibly ICU versions etc.)
Now using your binary …
- Ran
./WiktionarySplitter.sh - Ran
./generate_dictionaries.sh, which seemingly created all v7 dictionaries. (I did not modify it to exclude any files, so it was supposed to create all dictionaries.)
Btw, this part in generate_dictionaries.sh made me bring up "reverse dictionaries":
reverse_dicts=""
if test "$lang" = "DE" -o "$lang" = "FR" -o "$lang" = "IT" ; then
reverse_dicts="--input3=data/inputs/wikiSplit/$langcode/EN.data --input3Format=WholeSectionToHtmlParser --input3Name=${langcode}wikitionary --input3WiktionaryLang=$lang --input3TitleIndex=1 --input3WebUrlTemplate=http://${langcode}.wiktionary.org/wiki/%s"
#reverse_dicts="$reverse_dicts --input4=data/inputs/wikiSplit/$langcode/EN.data --input4Name=${langcode}wikitionary --input4Format=enwiktionary --input4LangPattern=${enlangname} --input4LangCodePattern=en --input4EnIndex=1 --input4WiktionaryType=EnForeign"
fi
- Ran
./genv6.sh, which created lots of v6 dictionaries (including all EN-xx) until it arrived at above error when trying to create theEN.dic.
Let me know if I must do all this again, since it takes many, many hours…
I don't think there is a point doing it again, but downloadInputs always downloads whichever is latest, so it would be good to know the exact database dump date. But since they are regularly deleted, that might not work. Maybe you could upload your wikiSplit/en/EN.data somewhere? And as mentioned, the v7 dictionary you got might help as well.
Btw you know that you can edit just the first few lines of the generate_dictionaries.sh file to only generate this failing dictionary? Would look like this: DE_DICTS=true DE_DICTS=false EN_DICTS=true EN_DICTS=false FR_DICTS=true FR_DICTS=false IT_DICTS=true IT_DICTS=false EN_TRANS_DICTS=true EN_TRANS_DICTS=false SINGLE_DICTS="en de fr it es pt" SINGLE_DICTS="en"
The problem here is indeed that the EN.quickdic file is incomplete: A valid file would always end with the text "END OF DICTIONARY", but in this case actually the whole index is missing. I've not gotten around to check why that might have happened.
Using your EN.data.gz I did get a working EN.quickdic. Maybe you could re-try the generate_dictionaries.sh step (for only this one, should take very little time) and check the output for anything suspicious?
Ok, I changed generate_dictionaries.sh as suggested above, and did a
./generate_dictionaries.sh &> gen-EN.log
I attach gen-EN.log. In spite of generating LOTS of warnings and errors, and data/inputs/wikiSplit/en/EN.data.gz still being the same file from 2021-12-07, a v7 dict is now generated, and roughly 20 MB larger. And it has "END OF DICTIONARY" at the end.
So there must be some difference between running the "all-enabled" and the "one file only" generate_dictionaries.sh process.
EDIT: ./genv6.sh also runs through on this newly-created v7 EN.quickdic.
The log of a working run will not really help :) You can check the script, it runs exactly the same command if you generate one or many dictionaries. However the script does not really have much error checking. My best guess (but still just a pretty arbitrary one) would be that something temporarily used a lot of RAM or storage and caused that one run to crash. Especially with older Java versions it can use up all the 8 GB max heap when generating the dictionary.
Yeah, I guess RAM might have been "it"—this laptop only has 8 GB RAM (4 cores, 8 threads total) and the whole process was pretty much eating resources when generating all dictionaries.