cv-sentence-extractor Adding Thai rules for CV Sentence Extractor

th.toml:

other_patterns borrowing from: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/th.js (BEGIN_REGEX, END_REGEX, STRUCTURE_REGEX, and ABBREVIATION_REGEX with few adjustments)
replacements borrowing from: https://github.com/common-voice/sentence-collector/blob/main/server/lib/cleanup/languages/th.js (with some adjustments)
min_word_count and max_word_count are set on the basis of treating "word" as "a group of character between two whitespaces/punctuations", since currently there's no Thai word tokenization in the extractor.

This will close #133

How many sentences did you get at the end?

478

How did you create the blocklist file?

Since the current tokenizer does not work well with a language using no space as a word delimiter, cvtools seems doesn't work, so I haven't create one.

Review / Error ratio

(from 184 samples)

Category	%
OK	88
A: Spelling is not correct	1
B: Grammar is not correct	0
C: It's not easily speakable (including uncommon non-native words)	1
D: Other	10

"D" are mostly sentences with a "dangling word" in the beginning (it is meant to be a last word in the previous sentence).

Since the total number of the sentences I have is just below 500, and the suggested amount of random sample is "100-500", I'm not sure if the amount of sentences I have is just unexpectedly low or not.

Would like to clarify this before I ask more people for review.

(I may have to "relax" the rules, but still not sure if this related to the way the punkt sentence tokenizer works or not).

The extracted sentences are here: https://docs.google.com/spreadsheets/d/1pKBH_YQiO9ZdXIduvrb37HvCLlKBt8mGeDCpX8e8dT4/edit?usp=sharing

Questions

Does the original number of articles in Wikipedia also affect the number of extracted output as well?

Tried to extract all the articles, without rules applying, with this command:

cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt

Got this

$ wc -l wiki.th.*
 1985699 wiki.th.all.txt
     478 wiki.th.txt

We actually have a lot of lines extracted in wiki.th.all.txt (1,314,274 lines after blank lines removed), but looks like these "sentences" are tend to be very long. In fact, a lot of lines contains more than one sentence (can be a whole paragraph).

And the longer the line/sentence is, the more likely that it will got hit by one of the disallowing rules.

Few sample lines from wiki.th.all.txt (applying no rules):

ดาราศาสตร์เป็นหนึ่งในสาขาของวิทยาศาสตร์ที่เก่าแก่ที่สุด นักดาราศาสตร์ในวัฒนธรรมโบราณสังเกตการณ์ดวงดาวบนท้องฟ้าในเวลากลางคืน และวัตถุทางดาราศาสตร์หลายอย่างก็ได้ถูกค้นพบเรื่อยมาตามยุคสมัย อย่างไรก็ตาม กล้องโทรทรรศน์เป็นสิ่งประดิษฐ์ที่จำเป็นก่อนที่จะมีการพัฒนามาเป็นวิทยาศาสตร์สมัยใหม่ ตั้งแต่อดีตกาล ดาราศาสตร์ประกอบไปด้วสาขาที่หลากหลายเช่น การวัดตำแหน่งดาว การเดินเรือดาราศาสตร์ ดาราศาสตร์เชิงสังเกตการณ์ การสร้างปฏิทิน และรวมทั้งโหราศาสตร์ แต่ดาราศาสตร์ทุกวันนี้ถูกจัดว่ามีความหมายเหมือนกับฟิสิกส์ดาราศาสตร์ ตั้งแต่คริสต์ศตวรรษที่ 20 เป็นต้นมา ดาราศาสตร์ได้แบ่งออกเป็นสองสาขาได้แก่ ดาราศาสตร์เชิงสังเกตการณ์ และดาราศาสตร์เชิงทฤษฎี ดาราศาสตร์เชิงสังเกตการณ์จะให้ความสำคัญไปที่การเก็บและการวิเคราะห์ข้อมูล โดยการใช้ความรู้ทางกายภาพเบื้องต้นเป็นหลัก ส่วนดาราศาสตร์เชิงทฤษฎีให้ความสำคัญไปที่การพัฒนาคอมพิวเตอร์หรือแบบจำลองเชิงวิเคราะห์ เพื่ออธิบายวัตถุท้องฟ้าและปรากฏการณ์ต่าง ๆ ทั้งสองสาขานี้เป็นองค์ประกอบซึ่งกันและกัน กล่าวคือ ดาราศาสตร์เชิงทฤษฎีใช้อธิบายผลจากการสังเกตการณ์ และดาราศาสตร์เชิงสังเกตการณ์ใช้ในการรับรองผลจากทางทฤษฎี
เมื่อสังคมมีวิวัฒนาการขึ้นในดินแดนต่าง ๆ การสังเกตการณ์ทางดาราศาสตร์ก็ซับซ้อนมากขึ้น โดยเฉพาะอย่างยิ่งใน เมโสโปเตเมีย กรีก จีน อียิปต์ อินเดีย และ มายา เริ่มมีแนวคิดเกี่ยวกับความสัมพันธ์ของธรรมชาติแห่งจักรวาลกว้างขวางขึ้น ผลการศึกษาดาราศาสตร์ในยุคแรก ๆ จะเป็นการบันทึกแผนที่ตำแหน่งของดวงดาวต่าง ๆ อันเป็นศาสตร์ที่ปัจจุบันเรียกกันว่า การวัดตำแหน่งดาว (astrometry) ผลจากการเฝ้าสังเกตการณ์ทำให้แนวคิดเกี่ยวกับการเคลื่อนที่ของดวงดาวต่าง ๆ เริ่มก่อตัวเป็นรูปร่างขึ้น ธรรมชาติการเคลื่อนที่ของดวงอาทิตย์ ดวงจันทร์ และโลก นำไปสู่แนวคิดเชิงปรัชญาเพื่อพยายามอธิบายปรากฏการณ์เหล่านั้น ความเชื่อดั้งเดิมคือโลกเป็นศูนย์กลางของจักรวาล โดยมีดวงอาทิตย์ ดวงจันทร์ และดวงดาวต่าง ๆ เคลื่อนที่ไปโดยรอบ แนวคิดนี้เรียกว่า แบบจำลองแบบโลกเป็นศูนย์กลางจักรวาล (geocentric model)
เคปเลอร์ได้คิดค้นระบบแบบใหม่ขึ้นโดยปรับปรุงจากแบบจำลองเดิมของโคเปอร์นิคัส ทำให้รายละเอียดการโคจรต่าง ๆ ของดาวเคราะห์และดวงอาทิตย์ที่ศูนย์กลางสมบูรณ์ถูกต้องมากยิ่งขึ้น แต่เคปเลอร์ก็ไม่ประสบความสำเร็จในการนำเสนอทฤษฎีนี้เนื่องจากกฎหมายในยุคสมัยนั้น จนกระทั่งต่อมาถึงยุคสมัยของเซอร์ ไอแซค นิวตัน ผู้คิดค้นหลักกลศาสตร์ท้องฟ้าและกฎแรงโน้มถ่วงซึ่งสามารถอธิบายการเคลื่อนที่ของดาวเคราะห์ได้อย่างสมบูรณ์ นิวตันยังได้คิดค้นกล้องโทรทรรศน์แบบสะท้อนแสงขึ้นด้วย
ไม่ควรสับสนระหว่างดาราศาสตร์โบราณกับโหราศาสตร์ ซึ่งเป็นความเชื่อที่นำเอาเหตุการณ์และพฤติกรรมของมนุษย์ไปเกี่ยวโยงกับตำแหน่งของวัตถุท้องฟ้า แม้ว่าทั้งดาราศาสตร์และโหราศาสตร์เกิดมาจากจุดร่วมเดียวกัน และมีส่วนหนึ่งของวิธีการศึกษาที่เหมือนกัน เช่นการบันทึกตำแหน่งดาว (ephemeris) แต่ทั้งสองอย่างก็แตกต่างกัน

I guess if we can make the lines shorter, we can get more extracted sentences in wiki.th.txt

Need some suggestions here. Thank you.

Apr 15 '21 03:04 bact

What is general recommendation for numbers (0-9) btw?

I see languages like en, de allow them, but language like ka doesn't.

Apr 15 '21 04:04 bact

Thanks for your efforts here. This perfectly well shows how broken the sentence segmentation is for some languages :( There's #11 already on file for this issue. I've also created a discussion/proposal at https://discourse.mozilla.org/t/future-of-the-sentence-extractor-your-input-is-required/78139 .

Apr 15 '21 17:04 MichaelKohler

Reviewed 184 samples from the current extracted sentences, got "OK" for 88%.

The rest of the errors are mostly due to a "dangling word" - words that meant to be a first/last word in the next/previous sentence, but got incorrectly included in the sentence in question. (probably due to a space)

I updated the first comment with error table.

Apr 16 '21 05:04 bact

Continue from discussion in https://github.com/common-voice/cv-sentence-extractor/issues/139#issuecomment-821964021 , I'm thinking of one possible way to extract Thai sentences and guarantee the 3 sentences limit.

A sentence splitter may work with JSON files inside wikiextractor/text (created by WikiExtractor.py).

The sentence splitter will read text value from each JSON objects inside those files and insert a newline character to assist the sentence extraction (later by Common Voice's Sentence Extractor).

I will try to have a prototype on this. If success, this will work on top of current pipeline:

1-Get dump -> 2-Extract dump -> 3-Extract sentences

and expand it to

1-Get dump -> 2-Extract dump -> 3-Split sentences -> 4-Extract sentences.

Jun 09 '21 18:06 bact

@bact I've created a proof of concept to use a Python based sentence splitting algorithm, to make sure that the Sentence Extractor can also be used for language that rust-punkt does not support. I've created a PR and would like your input on whether it's it's clear what you would need to do when reading the README. Happy to hear your feedback on https://github.com/common-voice/cv-sentence-extractor/pull/150/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R233 :)

Jul 17 '21 21:07 MichaelKohler

The segmenter PR has now been merged, check out https://github.com/common-voice/cv-sentence-extractor#using-a-different-segmenter-to-split-sentences for more info. Looking forward to hear if that helps with Thai :)

Jul 18 '21 12:07 MichaelKohler

Thank you @MichaelKohler . The new option segmenter is a welcome. I think this will make the pipeline more standardized, even with different language-specific processors. Will take a look more on this.

Aug 08 '21 11:08 bact

I was initially thought that crfcut may work for this, but after several tries and inspections into the split text - some of the output starts or ends with an ill-formed word, very likely because the text got segmented at an invalid point (like before a following vowel: ก|า ).

Currently trying to see if I can have a wrapper to post-process the output from crfcut, or does there any other alternative.

Aug 08 '21 11:08 bact

cv-sentence-extractor cv-sentence-extractor copied to clipboard

Adding Thai rules for CV Sentence Extractor

How many sentences did you get at the end?

How did you create the blocklist file?

Review / Error ratio

Questions

cv-sentence-extractor
cv-sentence-extractor copied to clipboard