tesstrain Create Character Count from training text

Character frequency report Source: https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py

USAGE: count_chars.py <txt_file> | sort -n -r > <txt_file>.charcount

Mar 16 '21 16:03 Shreeshrii

This pull request introduces 1 alert when merging cec80b73976e0e589d1b1a32491c0f060621342f into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

Mar 16 '21 16:03 lgtm-com[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 02 '21 16:06 stale[bot]

Fixes #221 IINM

Jun 03 '21 12:06 bertsky

@bertsky I copied the script from another repo and added an external sort as a quick hack to get the char count. Please feel free to modify as needed.

Jul 16 '21 01:07 Shreeshrii

Please make a description of input data and needed output, so those of us how did not have a look at the file can create a new script under Apache license.

Sep 04 '21 07:09 zdenop

We cannot use GPL code in tesstrain with Apache license.

I don't think this short script meets the threshold of originality, though. After all, it just counts characters of files in Python. My above suggestions already changed most of the script's lines to make it more useful and makefile-workable. Just apply them and strip the Kraken reference. (I am not contesting Kraken's originality, only that single file's.)

Sep 06 '21 10:09 bertsky

Also, it would help to offer a rule for the makefile already.

Besides, in 314e799 I proposed a similar functionality (only using shell means, i.e. grep -o . | sort | uniq -c | sort -rn) – it does not show codepoint names via unicodedata.name, but apart from that should be the same).

Or abandon this PR and just merge #260.

Sep 06 '21 10:09 bertsky

Here is my python solution without need of extra tools. I did not implement reading form stdin as I do not see it usage in "make training"....

Usage: python3 count_chars.py data/foo/all-gt

#!/usr/bin/env python3
# -*- coding: utf-8 -*-


import sys
import unicodedata
from collections import Counter, OrderedDict


def show_char_frequency(string):
    string = string.replace(" ", "")
    dic = OrderedDict(Counter(string).most_common())
    for char in dic:
        print(f"{char}\t{dic[char]}\t{unicodedata.name(char)}")


def read_file(filename):
    with open(filename, encoding="utf-8", mode="rt") as fd:
        text_lines = fd.read().strip().split("\n")
    return " ".join(text_lines)


def main():
    if len(sys.argv) < 2:
        print(f"USAGE: {sys.argv[0]} <txt_file>")
        return 1

    filename = sys.argv[1]
    string = read_file(filename)
    show_char_frequency(string)
    return 0


if __name__ == "__main__":
    main()

Jan 09 '23 18:01 zdenop

Commit 29d394b7253c9f933a7fdf57f553d305576f9a5d merged the modifications (based on the original code) which were made in this pull request.

Mar 09 '24 07:03 stweil

tesstrain tesstrain copied to clipboard

Create Character Count from training text

tesstrain
tesstrain copied to clipboard