cv-sentence-extractor
cv-sentence-extractor copied to clipboard
added hindi language toml and wiki sample
How many sentences did you get at the end? 4500 lines on output
How did you create the blacklist file? removed all characters from English language.
Review For review please use sample file wiki.hi.txt.
Also, can you remove the sample file and add it somewhere online? We eventually do not want this as part of the source code here.
Thanks, Michael. Responses to your questions below.
-
Are there any Hindi script specific symbols we might not want? (I have no idea about hindi) None. We want all Hindi symbols included.
-
Are there Hindi specific abbreviation patterns? Nothing different for Hindi.
-
Did you check if some of the newer rules might be helpful such as even_symbols or replacements? Thanks. Yes. I have now included a replacement rule. In fact, this replaces Hindi's period "।" symbol (indicating end of Hindi sentence) with the standard period "." symbol. And I need this replacer to run before the SentenceTokenizer in extractor.rs so that each piece of text is broken up into sentences correctly before the rules are checked. That is what there is a slight code change requested in extractor.rs at line 108. Can you please check if this is OK?
-
Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names Thanks. Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.
-
How many sentences did you get in total? I assume 4500 is just for the review? We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?
Thanks.
Thanks for your answers!
Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.
Which limit did you choose?
We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?
A maximum of 3 sentences per article is a legal requirement, we can't go higher than that.
Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?
- Update your branch with the latest code from the
masterbranch - Rename
src/rules/hindi.tomltosrc/rules/hi.toml - Rename
src/rules/disallowed_words/hindi.tomltosrc/rules/disallowed_words/hi.toml
Also note that the local command for extraction will now be:
cargo run -- extract -l hi -d path/to/files
Happy to answer any question you may have and thanks for your efforts!
I'll comment on the change in extractor.rs and some other things separately.
@karthiksibm can you please also have a look at the other comments I've made?
I've made the updates to hi.toml. Thanks for your comments.
Error Rate Review:
-
Reviewer 1 - error rate: 10% https://docs.google.com/spreadsheets/d/1WoGyQH4ZW9f_N4FhHEOEdHQ4XoB0r2NAzgETSGPdFO4/edit#gid=0
-
Reviewer 2 - error rate: 12% https://docs.google.com/spreadsheets/d/1WYwPogPW3BRh3BYpoVquK-CGRod3HreHrtfVZsl2vwY/edit#gid=0
-
Reviewer 3 - error rate: 20% https://docs.google.com/spreadsheets/d/1Rpf6JC5QqiNwBJWPnCRqf3Hi1sRzJIHEoENbE3IBsZw/edit#gid=0
-
Reviewer 4 - error rate: 26% https://docs.google.com/spreadsheets/d/1ByQ5o3wtE7tm1ieedC9IFPM0a6Y0K1p-KFgDj-B2uuU/edit#gid=0
These numbers are a bit too high. @nukeador I forgot what the required minimum was, can you remind me?
Can you look at the sentences and see if you can
- identify common words that could be added to the blacklist?
- consider decreasing the minimum frequency for the blacklist?
- find any other common wrong patterns that could be added to the rules? (you can also use the abbreviation pattern for other stuff, check the German (de) one for examples)
Thanks for your efforts!
The error rate should be between 5-7%. Anything lower of course is great, but probably very hard to achieve.
Thanks @MichaelKohler . Looks like it has too many complicated, long words which make them hard to pronounce.
To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?
Meanwhile, I will play around to try and catch them into a better blacklist words set.
To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?
There is currently no such setting, but you could use a Regex in the abbreviations_patterns section to filter those out. I'll have a quick look if I can come up with a regex.
@karthiksibm it seems you can use \\w{5,50} in the abbreviation_patterns to exclude any words larger than 4 words. Adjust 5 to the maximum of characters per word that should be excluded. Would be great if you could add a comment in the file explaining that, otherwise we might wonder in the future why that is :)
If you merge latest master into your branch, you can also use the other_patterns config rule to add that, then it's not so confusing as that's not really an abbreviation pattern.
Looks like it has too many complicated, long words which make them hard to pronounce.
.. but those words are still appearing with a high frequency? If not, increasing the minimum frequency of the blacklist might also be a way to go
Error Rate Review:
-
Reviewer 1 - error rate: 3% https://docs.google.com/spreadsheets/d/1l_5bP01ggRbVcwBosBcTCT0zkB6cABPTsnQzKAJdlJU/edit?usp=sharing
-
Reviewer 2 - error rate: 8% https://docs.google.com/spreadsheets/d/1xiWdSXmPOWdxTYnPuGLvp4s3CiRWoYq3BBaqMwO_Dug/edit?usp=sharing
-
Reviewer 3 - error rate: 8% https://docs.google.com/spreadsheets/d/1FEBKn2Z3jEr93jGdpkz3kLxfzXQL1ClTUwU8amraHoo/edit?usp=sharing
Looks better now by improving the blacklist.
Thanks, that looks better. How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.
I've just saw the following sentence:
राज घाट, नई दिल्ली, में गांधी जी के स्मारक पर "देवनागरी में " हे राम " लिखा हुआ है.
You might want to have a look at the even_symbols config. With that this should be easy to catch.
@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?
Feel free to join our matrix chat so we can support you to get more sentences, my understanding is that Hindi wikipedia has 180K articles, it's weird you are only getting 90K sentences.
https://chat.mozilla.org/#/room/#common-voice-sentence-extractor:mozilla.org
@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?
Check https://github.com/Common-Voice/cv-sentence-extractor/pull/89#issuecomment-594431305 where the answer is "We get around 90K sentences."
However, the following questions should still be answered before we proceed here. I'm mostly worried about not having a recent commit for the blacklist change.
How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.
@MichaelKohler sorry I got busy with other projects. I'll get back quickly with the answer to your question.
@MichaelKohler checked in the latest rules and the blacklist file. The blacklist was obtained with frequency of 50 and also including words longer than 9 characters. That resulted in improved readability.
@MichaelKohler