tesstrain
tesstrain copied to clipboard
Create Character Count from training text
Character frequency report Source: https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py
USAGE: count_chars.py <txt_file> | sort -n -r > <txt_file>.charcount
This pull request introduces 1 alert when merging cec80b73976e0e589d1b1a32491c0f060621342f into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com
new alerts:
- 1 for Except block handles 'BaseException'
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Fixes #221 IINM
@bertsky I copied the script from another repo and added an external sort as a quick hack to get the char count. Please feel free to modify as needed.
Please make a description of input data and needed output, so those of us how did not have a look at the file can create a new script under Apache license.
We cannot use GPL code in tesstrain with Apache license.
I don't think this short script meets the threshold of originality, though. After all, it just counts characters of files in Python. My above suggestions already changed most of the script's lines to make it more useful and makefile-workable. Just apply them and strip the Kraken reference. (I am not contesting Kraken's originality, only that single file's.)
Also, it would help to offer a rule for the makefile already.
Besides, in 314e799 I proposed a similar functionality (only using shell means, i.e.
grep -o . | sort | uniq -c | sort -rn) – it does not show codepoint names viaunicodedata.name, but apart from that should be the same).
Or abandon this PR and just merge #260.
Here is my python solution without need of extra tools. I did not implement reading form stdin as I do not see it usage in "make training"....
Usage: python3 count_chars.py data/foo/all-gt
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
import unicodedata
from collections import Counter, OrderedDict
def show_char_frequency(string):
string = string.replace(" ", "")
dic = OrderedDict(Counter(string).most_common())
for char in dic:
print(f"{char}\t{dic[char]}\t{unicodedata.name(char)}")
def read_file(filename):
with open(filename, encoding="utf-8", mode="rt") as fd:
text_lines = fd.read().strip().split("\n")
return " ".join(text_lines)
def main():
if len(sys.argv) < 2:
print(f"USAGE: {sys.argv[0]} <txt_file>")
return 1
filename = sys.argv[1]
string = read_file(filename)
show_char_frequency(string)
return 0
if __name__ == "__main__":
main()
Commit 29d394b7253c9f933a7fdf57f553d305576f9a5d merged the modifications (based on the original code) which were made in this pull request.