wikokit
wikokit copied to clipboard
IndexOutOfBoundsException m.group(2)
What steps will reproduce the problem?
1. Use newest wictionary database loaded with mwdumper
2. Try to parse with wikt parser
3. Cath exception
What is the expected output? What do you see instead?
java -cp
./dist/wikt_parser.jar;./dist/lib/mysql-connector-java-5.1.13-bin.jar;./dist/lib
/common_wiki.jar -Xms1212m -Xmx1212m -Xmn16m -XX:+DisableExplicitGC
wikt.parser.Main en 0 1>enwikt20100824_parsed_06.log
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 2
at java.util.regex.Matcher.group(Matcher.java:487)
at wikokit.base.wikt.multi.en.WRedirectEn.getRedirect(WRedirectEn.java:4
2)
at wikokit.base.wikt.word.WRedirect.getRedirect(WRedirect.java:39)
at wikokit.base.wikt.word.WordBase.<init>(WordBase.java:63)
at wikt.parser.WiktParser.parseWiktionaryEntry(WiktParser.java:193)
at wikt.parser.PageTableAll.parseAllPages(PageTableAll.java:183)
at wikt.parser.Main.main(Main.java:101)
What version of the product are you using? On what operating system?
Last svn trunk.
windows 8 64bit, Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 8 Apr 2013 at 6:29
I'm not coder but i tried to add
Matcher m = ptrn_redirect.matcher(text);
if (m.find()){
try
{
return m.group(2);
}
catch (IndexOutOfBoundsException e) {
}
finally
{
return null;
}
}
Original comment by [email protected]
on 8 Apr 2013 at 6:31
Could you send several last lines in the log file
`enwikt20100824_parsed_06.log`?
It will be interesting to look at the problem entry in English Wiktionary.
Original comment by [email protected]
on 8 Apr 2013 at 8:11
You can use information from this log file in order to continue parsing.
E.g. last line in log file is:
2394000: z?na, duration: 575 min, remain: 60 min
Then you should run parser with command:
java -cp
./dist/wikt_parser.jar;./dist/lib/mysql-connector-java-5.1.13-bin.jar;./dist/lib
/common_wiki.jar -Xms1212m -Xmx1212m -Xmn16m -XX:+DisableExplicitGC
wikt.parser.Main en 2394000 > enwikt20100824_parsed_07.log
That is 2394000 instead of 0.
Original comment by [email protected]
on 8 Apr 2013 at 8:17
Oh, sorry i already restarted app and so log was overwritten.
Thanks for solution how to resume parsing.
Original comment by [email protected]
on 8 Apr 2013 at 8:34
If you restarted app, then do not forget recreate empty Wiktionary parsed
database.
Good luck :)
Original comment by [email protected]
on 8 Apr 2013 at 8:36
Oh? I need to truncate all tables before restarting...
And something other bothering me.
>>13000: Anne_Marie, duration: 156 min, remain: 39244 min
Original comment by [email protected]
on 8 Apr 2013 at 8:40
Yes,
1) if you *restart* parsing from zero
you need delete, create, and source enwikt_parsed database (in MySQL).
See http://code.google.com/p/wikokit/wiki/File_wikt_parsed_empty_sql
2) If you *resume* parsing from page NNNN, then you do not need delete anything.
Original comment by [email protected]
on 8 Apr 2013 at 9:01