cv-sentence-extractor icon indicating copy to clipboard operation
cv-sentence-extractor copied to clipboard

WIP: Add rules for Swedish

Open andersjohansson opened this issue 5 years ago • 16 comments

Initial rules for extracting Swedish. Seems to give reasonable output already, albeit with unusual words here and there that could very well be filtered out with a blocklist of uncommon words.

Let’s try to generate one: /action blocklist sv 80

andersjohansson avatar Jul 06 '20 16:07 andersjohansson

Seems to not work in the initial comment, as that's not an issue comment created. Should work here though:

/action blocklist sv 80

MichaelKohler avatar Jul 06 '20 17:07 MichaelKohler

Job started: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047

github-actions[bot] avatar Jul 06 '20 17:07 github-actions[bot]

One issue that I did think about is that 14-word sentences in Swedish can tend to be pretty long, as Swedish, like German, write compound words together, to give fairly long words. One example from my extraction: “Kulturantropologer undersöker de processer som producerar, upprätthåller och förändrar kulturella beteendemönster, samhällsstrukturer och meningssystem.” That's 48 syllables,

Would it be reasonable to use a lower max word limit?

The particular example sentence would probably be filtered out with a "used more than 80-times" blocklist ("meningssystem" is used 3 times when I ripgrep through the wikiextracted text), but some potentially very long sentences could be constructed from pretty common compound words.

andersjohansson avatar Jul 06 '20 17:07 andersjohansson

That's definitely something to keep in mind while reviewing. How long does it take to say that sentence?

MichaelKohler avatar Jul 06 '20 17:07 MichaelKohler

About 8 seconds, timing myself. But I think it would be something like that for most people, if they don't stumble on the words, which is quite possible reading a sentence like that the first time. I'll keep that in mind for reviewing. Should the goal be for sentences to be fairly straightforward to say for most people, and not too long?

andersjohansson avatar Jul 06 '20 17:07 andersjohansson

I'd say around 8 seconds is fine. However I'd say it shouldn't be all sentences that long, might get quite exhausting after recording for some time.

MichaelKohler avatar Jul 06 '20 18:07 MichaelKohler

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047 Don't forget to download the artifacts.

github-actions[bot] avatar Jul 06 '20 20:07 github-actions[bot]

@andersjohansson you'll find the blocklist at the top right of the following link as posted by the previous comment: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/159592047

Anything I could help you with?

MichaelKohler avatar Jul 11 '20 17:07 MichaelKohler

That’s great! I’m away from my computer for a few weeks now so won’t be able to take it forward for a while.

andersjohansson avatar Jul 11 '20 18:07 andersjohansson

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

One problem that I have noted with Swedish Wikipedia is that it contains a massive amount of bot-articles by lsjbot (https://en.wikipedia.org/wiki/Lsjbot). This is fine for Wikipedia but very few of these articles contain suitable sample sentences. Some examples: https://sv.wikipedia.org/wiki/Hillaby https://sv.wikipedia.org/wiki/Cyperus_pacificus

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

andersjohansson avatar Jul 27 '20 16:07 andersjohansson

The sample extraction seems to result in an empty file? Could it be that all sentences are rejected for some reason or is there some other problem?

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

A lot of the words from these articles also contribute to the massive list of unusual words to block. Would it be possible to exclude these bot-articles in some way before extracting stuff?

I thought there was a discussion around that somewhere, however I can't find it. As far as I remember this is not possible as we're not getting author information in the output of the WikiExtractor script.

MichaelKohler avatar Jul 28 '20 17:07 MichaelKohler

There seems to have been an error downloading the WikiExtractor script. I've manually restarted the job, let's see if that helps.

Looks like it doesn't. Will have a look tomorrow.

MichaelKohler avatar Jul 28 '20 17:07 MichaelKohler

/action blocklist sv 80

(ignore the output, this is for testing only)

MichaelKohler avatar Jul 29 '20 18:07 MichaelKohler

Job started: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977

github-actions[bot] avatar Jul 29 '20 18:07 github-actions[bot]

@andersjohansson I think I have fixed the issue for now. If you merge master into your branch and push it, it should generate a new sample output.

MichaelKohler avatar Jul 29 '20 18:07 MichaelKohler

Job finished: https://github.com/Common-Voice/cv-sentence-extractor/actions/runs/187520977 Don't forget to download the artifacts.

github-actions[bot] avatar Jul 29 '20 21:07 github-actions[bot]