langdata icon indicating copy to clipboard operation
langdata copied to clipboard

langdata/pol/pol.wordlist duplicated entries

Open neube3 opened this issue 8 years ago • 0 comments

Hello!

In https://raw.githubusercontent.com/tesseract-ocr/langdata/master/pol/pol.wordlist (05ec588 on 25 Jun 2015) is a great list of Polish words.

Somehow, though, even as I am Polish, using Polish keyboard and Polish Windows 8.1 with Polish fonts, I see "st" (or "?" in notepad2) in many lines - this is not a character you encounter at all in Polish language.

After a few seconds, I realised that every line with the mysterious st sign follows a line with "st" bigram - for which the st substitutes.

There is no "st" digraph (there are no strict bigraphs in Polish at all - they are always made with two separate letters, "rz" is just "r" and "z", "ch" is just "c" and "h", the same goes for "sz", "cz", "dz", "dż" and "dź") in Polish.

So, basically, the list has a duplicated entry for every word with "st" bigraph. There are currently 658822 full lines + 1 newline in the raw file; after I made a quick regexp in notepad2 to remove the duplicates, I ended up with 608933 full lines + 1 newline - an 8% reduction in line count.

Now, if there is a legitimate reason for the duplicates with a non-existent characters (maybe it's easier to OCR with such redundancy? I don't think I know the topic well enough to even guess) then great, this issue is moot and invalid. But if there is no such reason, then the Polish wordlist can be automatically pruned.

neube3 avatar Apr 20 '16 11:04 neube3