pack2 icon indicating copy to clipboard operation
pack2 copied to clipboard

new tool: analyze and split strings on character-class changes

Open roycewilliams opened this issue 5 years ago • 5 comments

As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)

An optional flag to consider changes in case to be significant could be useful.

For example, this list:

Hello123 PaSsWoRd$ hashes4evar

... might produce the following output, if case were treated as a character-class change:

Hello 123 Pa Ss Wo Rd $ hashes 4 evar

... and might produce this output, if case were not treated as a character-class change:

Hello 123 PaSsWoRd $ hashes 4 evar

I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :)

Hello hello 123 Pa Ss Wo Rd PaSsWoRd password $ hashes 4 evar

Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to not do this, to support larger data sets, would be good).

Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.

How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.

roycewilliams avatar May 12 '20 16:05 roycewilliams

A initial version of this has been implemented with 4d1b413. Normalization is not implemented yet. The new command is called cgrams as it's somewhat similar to n-grams. As per design principles everything outside \x20 - 0x7e is encoded in the $HEX[] format. Also a format like described here: https://github.com/Cynosureprime/chasm/issues/2#issuecomment-396687626 would be a good idea as it would allow for a per position aware PRINCE style attack.

hops avatar May 24 '20 19:05 hops

Excellent! Very fast in my initial testing. Looking forward to normalization; looking at the output, it's clear that dropping to lower case after treating upper-case as a split boundary will yield better base words, as I'm seeing lots of these:

ollyrancher olleygirl ogshit oejam odoubt ockyboy occergal

Once normalization is possible, the frequency count of these will merge upward into their lower-case equivalents.

Nice!

roycewilliams avatar May 24 '20 21:05 roycewilliams

I've added --ignore-case and --normalize to the cgrams command. With your example words it will produce the following ouput:

$ pack2 cgrams -i /tmp/test
Hello
123
PaSsWoRd
hashes
4
evar
$ pack2 cgrams -i -n /tmp/test
Hello
hello
123
PaSsWoRd
password
hashes
4
evar

Note: -n can only be use together with -i.

Edit: Hmm for whatever reason it ommits the $. So there's a bug somewhere.

hops avatar Jun 13 '20 00:06 hops

The bug omitting the last character if it's only one and we had a charset boundary change just before was introduce with 1eb16fb and is fixed in 145fff7. What's left to do is the format for a position aware PRINCE style attack.

hops avatar Jun 13 '20 12:06 hops

Not sure if related, but fresh install on a new system produces:

   Compiling structopt-derive v0.4.7
   Compiling structopt v0.3.14
   Compiling pack2 v0.1.0 (/usr/local/src/sec/crack/pack2)
error[E0599]: no associated item named `MAX` found for type `usize` in the current scope
  --> src/statsgen.rs:39:45
   |
39 |     let mut min_len:         usize = usize::MAX;
   |                                             ^^^ associated item not found in `usize`
   |
help: you are looking for the module in `std`, not the primitive type
   |
39 |     let mut min_len:         usize = std::usize::MAX;
   |                                      ^^^^^^^^^^^^^^^

error[E0599]: no associated item named `MAX` found for type `u16` in the current scope
  --> src/cgrams.rs:27:28
   |
27 |         if line_len > u16::MAX.into() || line_len ==  0 { continue; }
   |                            ^^^ associated item not found in `u16`
   |
help: you are looking for the module in `std`, not the primitive type
   |
27 |         if line_len > std::u16::MAX.into() || line_len ==  0 { continue; }
   |                       ^^^^^^^^^^^^^

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0599`.
error: could not compile `pack2`.

To learn more, run the command again with --verbose.

roycewilliams avatar Jun 13 '20 15:06 roycewilliams