pcfg_cracker icon indicating copy to clipboard operation
pcfg_cracker copied to clipboard

the probability of the generated guesses are not the same as the output of password_scorer

Open Veteranback opened this issue 1 year ago • 3 comments

Veteranback avatar Jul 25 '24 08:07 Veteranback

I use the command below to randomly generate passwords for monte carlo:

python3 pcfg_guesser.py -m random_walk

Then I use password_scorer to assign the probability of the randomly generated passwords:

python3 password_scorer.py  -i xxx

I found that in the ouput of password_scorer , some randomly generated passwords are assigned to zero. This is strange

Veteranback avatar Jul 25 '24 08:07 Veteranback

I personally think that the omen part in this pcfg program result in this situation. I look through the code of password_scorer, it does not include omen, but the guess generation includes the struct M which i think is for Omen. I wrote the monte carlo program based on the password_scorer (which means adding randomly sampleing), and the results of monte carlo is lower than actutal generation.

Veteranback avatar Jul 25 '24 08:07 Veteranback

A bigger problem is multiword detection. When training on the full dataset multiwords are identified because there is a good opportunity to learn what base words look like (multi words are learned dynamically vs. being pre-programed). The scorer doesn't have access to that data so it can't assign a probability to those guesses.

For example the multi-word detector in training might find "horse" and "staple" as individual words. So then it could generate a guess using the base structure A5A6 (aka a multiword) with the resulting guess 'horsestaple'. As that multiword did not exist in the original training set, the scorer doesn't know the probability to assign it. Also since it doesn't have the multiword info it can't "split" those words up again.

I don't have a good solution to this beyond saving the multiword data to reload into the scorer. I'm hesitant to take that route with the grammars I'm providing since they are already getting a bit too large for distribution, but I could see making that an option for people using the scorer and training on their own datasets.

lakiw avatar Aug 05 '24 16:08 lakiw