cv-sentence-extractor
cv-sentence-extractor copied to clipboard
Initial Polish language rules and blocklist
How many sentences did you get at the end?
Sentences got after blocklist incorporation: 214838
How did you create the blocklist file?
Block list is based on frequency of words with threshold 40 and added English wordlist taken from wordlist repo by dwyl
After a few interations my estimated correctness based on 400 sentence sample is ~96%. Probably could be better in reality but I marked sentences with hard to read words quite conservatively I think. Most problems are with surnames and names of places but I don't have reasonably complete way of filtering them atm. I'm open for additional suggestions.
Link to review spreadsheet with 400 sentences reviewed by me as Reviewer1: link
Additional sentence selection and filtration info: -removed most abbreviations which could be misspelled or read in wrong form by reader -based on Scarfmonsters initial config file i left replacements with ages like XII w. -replaced some abbreviations with full form when one available -filtered other patterns like 2-4 letters unkommon in polish words like qu, sh, łł, heim -left x, v and q as allowed letters - discussable
Used python segmenter with punkt which in trials gave better results than rust implementation (topic started by user Scarfmonster some time ago) and added some abbreviations tweaks found on gists (marked in source).
For now me and one more polish contributor did review and we estimated 96% and 94% of OK sentences from sample of random 400. I'll try get at least one more reviewer but I don't have high hopes.
@J-Wrobel did you have any luck getting a third reviewer?
No, tried to get someone to look at it but no luck despite some initial interest. Everyone seems busy :)