cv-sentence-extractor
cv-sentence-extractor copied to clipboard
Will update instructions to use latest wikiextractor
Adjusted the instructions in the Readme to use a more recent version of wikiextractor. It seems to be able to extract more content. In my tests for the Latvian, I am able to get about 5% more sentences if the updated wikiextractor is used.
Hard to tell if the updated wikiextractor is meaningfully better. I noticed that it is processing gallery templates in the articles better, so you get some sentences from those. Maybe other things.
If someone follows along and does some tests for some other language, this is something to test. Will leave this PR as a note and will update it to a specific commit if I happen to do some further work on this.