cv-sentence-extractor added hindi language toml and wiki sample

How many sentences did you get at the end? 4500 lines on output

How did you create the blacklist file? removed all characters from English language.

Review For review please use sample file wiki.hi.txt.

Feb 04 '20 08:02 karthiksibm

Also, can you remove the sample file and add it somewhere online? We eventually do not want this as part of the source code here.

Feb 04 '20 14:02 MichaelKohler

Thanks, Michael. Responses to your questions below.

Are there any Hindi script specific symbols we might not want? (I have no idea about hindi) None. We want all Hindi symbols included.
Are there Hindi specific abbreviation patterns? Nothing different for Hindi.
Did you check if some of the newer rules might be helpful such as even_symbols or replacements? Thanks. Yes. I have now included a replacement rule. In fact, this replaces Hindi's period "।" symbol (indicating end of Hindi sentence) with the standard period "." symbol. And I need this replacer to run before the SentenceTokenizer in extractor.rs so that each piece of text is broken up into sentences correctly before the rules are checked. That is what there is a slight code change requested in extractor.rs at line 108. Can you please check if this is OK?
Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names Thanks. Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.
How many sentences did you get in total? I assume 4500 is just for the review? We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

Thanks.

Mar 04 '20 10:03 karthiksibm

Thanks for your answers!

Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.

Which limit did you choose?

We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

A maximum of 3 sentences per article is a legal requirement, we can't go higher than that.

Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?

Update your branch with the latest code from the master branch
Rename src/rules/hindi.toml to src/rules/hi.toml
Rename src/rules/disallowed_words/hindi.toml to src/rules/disallowed_words/hi.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l hi -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

I'll comment on the change in extractor.rs and some other things separately.

Mar 04 '20 21:03 MichaelKohler

@karthiksibm can you please also have a look at the other comments I've made?

Mar 09 '20 17:03 MichaelKohler

I've made the updates to hi.toml. Thanks for your comments.

Mar 09 '20 17:03 karthiksibm

Error Rate Review:

Reviewer 1 - error rate: 10% https://docs.google.com/spreadsheets/d/1WoGyQH4ZW9f_N4FhHEOEdHQ4XoB0r2NAzgETSGPdFO4/edit#gid=0
Reviewer 2 - error rate: 12% https://docs.google.com/spreadsheets/d/1WYwPogPW3BRh3BYpoVquK-CGRod3HreHrtfVZsl2vwY/edit#gid=0
Reviewer 3 - error rate: 20% https://docs.google.com/spreadsheets/d/1Rpf6JC5QqiNwBJWPnCRqf3Hi1sRzJIHEoENbE3IBsZw/edit#gid=0
Reviewer 4 - error rate: 26% https://docs.google.com/spreadsheets/d/1ByQ5o3wtE7tm1ieedC9IFPM0a6Y0K1p-KFgDj-B2uuU/edit#gid=0

Mar 12 '20 14:03 karthiksibm

These numbers are a bit too high. @nukeador I forgot what the required minimum was, can you remind me?

Can you look at the sentences and see if you can

identify common words that could be added to the blacklist?
consider decreasing the minimum frequency for the blacklist?
find any other common wrong patterns that could be added to the rules? (you can also use the abbreviation pattern for other stuff, check the German (de) one for examples)

Thanks for your efforts!

Mar 12 '20 18:03 MichaelKohler

The error rate should be between 5-7%. Anything lower of course is great, but probably very hard to achieve.

Mar 12 '20 20:03 MichaelKohler

Thanks @MichaelKohler . Looks like it has too many complicated, long words which make them hard to pronounce.

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

Meanwhile, I will play around to try and catch them into a better blacklist words set.

Mar 14 '20 07:03 karthiksibm

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

There is currently no such setting, but you could use a Regex in the abbreviations_patterns section to filter those out. I'll have a quick look if I can come up with a regex.

Mar 14 '20 11:03 MichaelKohler

@karthiksibm it seems you can use \\w{5,50} in the abbreviation_patterns to exclude any words larger than 4 words. Adjust 5 to the maximum of characters per word that should be excluded. Would be great if you could add a comment in the file explaining that, otherwise we might wonder in the future why that is :)

Mar 14 '20 11:03 MichaelKohler

If you merge latest master into your branch, you can also use the other_patterns config rule to add that, then it's not so confusing as that's not really an abbreviation pattern.

Mar 14 '20 12:03 MichaelKohler

Looks like it has too many complicated, long words which make them hard to pronounce.

.. but those words are still appearing with a high frequency? If not, increasing the minimum frequency of the blacklist might also be a way to go

Mar 14 '20 17:03 MichaelKohler

Error Rate Review:

Reviewer 1 - error rate: 3% https://docs.google.com/spreadsheets/d/1l_5bP01ggRbVcwBosBcTCT0zkB6cABPTsnQzKAJdlJU/edit?usp=sharing
Reviewer 2 - error rate: 8% https://docs.google.com/spreadsheets/d/1xiWdSXmPOWdxTYnPuGLvp4s3CiRWoYq3BBaqMwO_Dug/edit?usp=sharing
Reviewer 3 - error rate: 8% https://docs.google.com/spreadsheets/d/1FEBKn2Z3jEr93jGdpkz3kLxfzXQL1ClTUwU8amraHoo/edit?usp=sharing

Looks better now by improving the blacklist.

Mar 18 '20 17:03 karthiksibm

Thanks, that looks better. How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

I've just saw the following sentence:

राज घाट, नई दिल्ली, में गांधी जी के स्मारक पर "देवनागरी में " हे राम " लिखा हुआ है.

You might want to have a look at the even_symbols config. With that this should be easy to catch.

Mar 18 '20 18:03 MichaelKohler

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Feel free to join our matrix chat so we can support you to get more sentences, my understanding is that Hindi wikipedia has 180K articles, it's weird you are only getting 90K sentences.

https://chat.mozilla.org/#/room/#common-voice-sentence-extractor:mozilla.org

May 08 '20 11:05 nukeador

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Check https://github.com/Common-Voice/cv-sentence-extractor/pull/89#issuecomment-594431305 where the answer is "We get around 90K sentences."

However, the following questions should still be answered before we proceed here. I'm mostly worried about not having a recent commit for the blacklist change.

How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

May 08 '20 11:05 MichaelKohler

@MichaelKohler sorry I got busy with other projects. I'll get back quickly with the answer to your question.

May 08 '20 11:05 karthiksibm

@MichaelKohler checked in the latest rules and the blacklist file. The blacklist was obtained with frequency of 50 and also including words longer than 9 characters. That resulted in improved readability.

May 08 '20 12:05 karthiksibm

@MichaelKohler

May 19 '21 06:05 Oymate

cv-sentence-extractor cv-sentence-extractor copied to clipboard

added hindi language toml and wiki sample

cv-sentence-extractor
cv-sentence-extractor copied to clipboard