tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Create Character Count from training text

Open Shreeshrii opened this issue 4 years ago • 8 comments

Character frequency report Source: https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py

USAGE: count_chars.py <txt_file> | sort -n -r > <txt_file>.charcount

Shreeshrii avatar Mar 16 '21 16:03 Shreeshrii

This pull request introduces 1 alert when merging cec80b73976e0e589d1b1a32491c0f060621342f into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

  • 1 for Except block handles 'BaseException'

lgtm-com[bot] avatar Mar 16 '21 16:03 lgtm-com[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 02 '21 16:06 stale[bot]

Fixes #221 IINM

bertsky avatar Jun 03 '21 12:06 bertsky

@bertsky I copied the script from another repo and added an external sort as a quick hack to get the char count. Please feel free to modify as needed.

Shreeshrii avatar Jul 16 '21 01:07 Shreeshrii

Please make a description of input data and needed output, so those of us how did not have a look at the file can create a new script under Apache license.

zdenop avatar Sep 04 '21 07:09 zdenop

We cannot use GPL code in tesstrain with Apache license.

I don't think this short script meets the threshold of originality, though. After all, it just counts characters of files in Python. My above suggestions already changed most of the script's lines to make it more useful and makefile-workable. Just apply them and strip the Kraken reference. (I am not contesting Kraken's originality, only that single file's.)

bertsky avatar Sep 06 '21 10:09 bertsky

Also, it would help to offer a rule for the makefile already.

Besides, in 314e799 I proposed a similar functionality (only using shell means, i.e. grep -o . | sort | uniq -c | sort -rn) – it does not show codepoint names via unicodedata.name, but apart from that should be the same).

Or abandon this PR and just merge #260.

bertsky avatar Sep 06 '21 10:09 bertsky

Here is my python solution without need of extra tools. I did not implement reading form stdin as I do not see it usage in "make training"....

Usage: python3 count_chars.py data/foo/all-gt

#!/usr/bin/env python3
# -*- coding: utf-8 -*-


import sys
import unicodedata
from collections import Counter, OrderedDict


def show_char_frequency(string):
    string = string.replace(" ", "")
    dic = OrderedDict(Counter(string).most_common())
    for char in dic:
        print(f"{char}\t{dic[char]}\t{unicodedata.name(char)}")


def read_file(filename):
    with open(filename, encoding="utf-8", mode="rt") as fd:
        text_lines = fd.read().strip().split("\n")
    return " ".join(text_lines)


def main():
    if len(sys.argv) < 2:
        print(f"USAGE: {sys.argv[0]} <txt_file>")
        return 1

    filename = sys.argv[1]
    string = read_file(filename)
    show_char_frequency(string)
    return 0


if __name__ == "__main__":
    main()

zdenop avatar Jan 09 '23 18:01 zdenop

Commit 29d394b7253c9f933a7fdf57f553d305576f9a5d merged the modifications (based on the original code) which were made in this pull request.

stweil avatar Mar 09 '24 07:03 stweil