wikokit icon indicating copy to clipboard operation
wikokit copied to clipboard

IndexOutOfBoundsException m.group(2)

Open GoogleCodeExporter opened this issue 9 years ago • 8 comments

What steps will reproduce the problem?
1. Use newest wictionary database loaded with mwdumper
2. Try to parse with wikt parser
3. Cath exception

What is the expected output? What do you see instead?

java -cp 
./dist/wikt_parser.jar;./dist/lib/mysql-connector-java-5.1.13-bin.jar;./dist/lib
/common_wiki.jar -Xms1212m -Xmx1212m -Xmn16m -XX:+DisableExplicitGC 
wikt.parser.Main en 0  1>enwikt20100824_parsed_06.log
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 2
        at java.util.regex.Matcher.group(Matcher.java:487)
        at wikokit.base.wikt.multi.en.WRedirectEn.getRedirect(WRedirectEn.java:4
2)
        at wikokit.base.wikt.word.WRedirect.getRedirect(WRedirect.java:39)
        at wikokit.base.wikt.word.WordBase.<init>(WordBase.java:63)
        at wikt.parser.WiktParser.parseWiktionaryEntry(WiktParser.java:193)
        at wikt.parser.PageTableAll.parseAllPages(PageTableAll.java:183)
        at wikt.parser.Main.main(Main.java:101)


What version of the product are you using? On what operating system?

Last svn trunk.

windows 8 64bit, Java(TM) SE Runtime Environment (build 1.7.0_17-b02)

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 8 Apr 2013 at 6:29

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

I'm not coder but i tried to add 
Matcher m = ptrn_redirect.matcher(text);
        if (m.find()){
           try
           {
            return m.group(2);
           }
            catch (IndexOutOfBoundsException e) {
        }
           finally 
           {
               return null;
           }
        }

Original comment by [email protected] on 8 Apr 2013 at 6:31

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

Could you send several last lines in the log file 
`enwikt20100824_parsed_06.log`?

It will be interesting to look at the problem entry in English Wiktionary.

Original comment by [email protected] on 8 Apr 2013 at 8:11

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

You can use information from this log file in order to continue parsing.

E.g. last line in log file is:
  2394000: z?na, duration: 575 min, remain: 60 min

Then you should run parser with command:

java -cp 
./dist/wikt_parser.jar;./dist/lib/mysql-connector-java-5.1.13-bin.jar;./dist/lib
/common_wiki.jar -Xms1212m -Xmx1212m -Xmn16m -XX:+DisableExplicitGC 
wikt.parser.Main en 2394000 > enwikt20100824_parsed_07.log

That is 2394000 instead of 0.

Original comment by [email protected] on 8 Apr 2013 at 8:17

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

Oh, sorry i already restarted app and so log was overwritten.

Thanks for solution how to resume parsing. 

Original comment by [email protected] on 8 Apr 2013 at 8:34

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

If you restarted app, then do not forget recreate empty Wiktionary parsed 
database.
Good luck :)

Original comment by [email protected] on 8 Apr 2013 at 8:36

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

Oh? I need to truncate all tables before restarting...
And something other bothering me.
>>13000: Anne_Marie, duration: 156 min, remain: 39244 min


Original comment by [email protected] on 8 Apr 2013 at 8:40

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

Yes, 
1) if you *restart* parsing from zero
you need delete, create, and source enwikt_parsed database (in MySQL).

See http://code.google.com/p/wikokit/wiki/File_wikt_parsed_empty_sql

2) If you *resume* parsing from page NNNN, then you do not need delete anything.

Original comment by [email protected] on 8 Apr 2013 at 9:01

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter

Ok, thanks.

Original comment by [email protected] on 8 Apr 2013 at 9:11

GoogleCodeExporter avatar Mar 24 '15 13:03 GoogleCodeExporter