distribution icon indicating copy to clipboard operation
distribution copied to clipboard

add a quick way of tokenizing by character

Open philipp opened this issue 6 years ago • 2 comments

Because the "tokenize" parameter is tested for existence, it's challenging to tokenize on "nothing" (which would split everything into individual characters)

Notably, there is also a difference in behavior between the Python and Perl implementations, in that distribution.py will successfully split on "0", while Perl will act as though I hadn't passed anything tokenize parameter in at all, with "-t=0"

The Perl-with-zero behavior should be easy to fix, but I'd suggest adding another special "tokenize" value (along with the existing "white" and "word") of "char" or something similar.

I'm not very experienced with Python, and while in Perl you can simply add a line like elsif ($tokenize eq 'char') { $tokenize = ''; } as far as I can tell, Python will not behave that way with splitting on an empty regex. And it's also beyond me how to properly test for "None" vs. some other existence thing to see if it was defined at all on the command line.

Anyway, there's always a work-around for now to split the entire thing before it even gets in. e.g. cat theFile | perl -ne 'print join "\n", split //' | distribution But it feels like something that should be available more easily.

philipp avatar Sep 12 '17 20:09 philipp

I'm not sure I understand what's being asked for here. Are you looking to get a distribution of every character in a file?

wizzat avatar Sep 13 '17 22:09 wizzat

From the python2 documentation for re:

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Does

sh-3.2$ echo "Hello, World" | distribution --rcfile=/dev/null --tokenize='(.)'
Key|Ct (Pct)    Histogram
l|3 (25.00%) -------------------------------------------------------------------
o|2 (16.67%) ---------------------------------------------
r|1  (8.33%) -----------------------
e|1  (8.33%) -----------------------
d|1  (8.33%) -----------------------
W|1  (8.33%) -----------------------
H|1  (8.33%) -----------------------
,|1  (8.33%) -----------------------
 |1  (8.33%) -----------------------

accomplish what you you are after?

bradfordboyle avatar Sep 14 '17 07:09 bradfordboyle