cv-sentence-extractor Initial Polish language rules and blocklist

Initial Polish language rules and blocklist

Open J-Wrobel opened this issue 4 years ago • 3 comments

trafficstars

How many sentences did you get at the end?

Sentences got after blocklist incorporation: 214838

How did you create the blocklist file?

Block list is based on frequency of words with threshold 40 and added English wordlist taken from wordlist repo by dwyl

After a few interations my estimated correctness based on 400 sentence sample is ~96%. Probably could be better in reality but I marked sentences with hard to read words quite conservatively I think. Most problems are with surnames and names of places but I don't have reasonably complete way of filtering them atm. I'm open for additional suggestions.

Link to review spreadsheet with 400 sentences reviewed by me as Reviewer1: link

Additional sentence selection and filtration info: -removed most abbreviations which could be misspelled or read in wrong form by reader -based on Scarfmonsters initial config file i left replacements with ages like XII w. -replaced some abbreviations with full form when one available -filtered other patterns like 2-4 letters unkommon in polish words like qu, sh, łł, heim -left x, v and q as allowed letters - discussable

Used python segmenter with punkt which in trials gave better results than rust implementation (topic started by user Scarfmonster some time ago) and added some abbreviations tweaks found on gists (marked in source).

Nov 10 '21 20:11 J-Wrobel

For now me and one more polish contributor did review and we estimated 96% and 94% of OK sentences from sample of random 400. I'll try get at least one more reviewer but I don't have high hopes.

Nov 20 '21 15:11 J-Wrobel

@J-Wrobel did you have any luck getting a third reviewer?

Jan 08 '22 13:01 MichaelKohler

No, tried to get someone to look at it but no luck despite some initial interest. Everyone seems busy :)

Jan 08 '22 13:01 J-Wrobel

cv-sentence-extractor cv-sentence-extractor copied to clipboard

Initial Polish language rules and blocklist

cv-sentence-extractor
cv-sentence-extractor copied to clipboard