pack2 icon indicating copy to clipboard operation
pack2 copied to clipboard

new tool: analyze and split strings on character-class changes

Open roycewilliams opened this issue 4 years ago • 5 comments

As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)

An optional flag to consider changes in case to be significant could be useful.

For example, this list:

Hello123 PaSsWoRd$ hashes4evar

... might produce the following output, if case were treated as a character-class change:

Hello 123 Pa Ss Wo Rd $ hashes 4 evar

... and might produce this output, if case were not treated as a character-class change:

Hello 123 PaSsWoRd $ hashes 4 evar

I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :)

Hello hello 123 Pa Ss Wo Rd PaSsWoRd password $ hashes 4 evar

Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to not do this, to support larger data sets, would be good).

Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.

How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.

roycewilliams avatar May 12 '20 16:05 roycewilliams