the probability of the generated guesses are not the same as the output of password_scorer
I use the command below to randomly generate passwords for monte carlo:
python3 pcfg_guesser.py -m random_walk
Then I use password_scorer to assign the probability of the randomly generated passwords:
python3 password_scorer.py -i xxx
I found that in the ouput of password_scorer , some randomly generated passwords are assigned to zero. This is strange
I personally think that the omen part in this pcfg program result in this situation. I look through the code of password_scorer, it does not include omen, but the guess generation includes the struct M which i think is for Omen. I wrote the monte carlo program based on the password_scorer (which means adding randomly sampleing), and the results of monte carlo is lower than actutal generation.
A bigger problem is multiword detection. When training on the full dataset multiwords are identified because there is a good opportunity to learn what base words look like (multi words are learned dynamically vs. being pre-programed). The scorer doesn't have access to that data so it can't assign a probability to those guesses.
For example the multi-word detector in training might find "horse" and "staple" as individual words. So then it could generate a guess using the base structure A5A6 (aka a multiword) with the resulting guess 'horsestaple'. As that multiword did not exist in the original training set, the scorer doesn't know the probability to assign it. Also since it doesn't have the multiword info it can't "split" those words up again.
I don't have a good solution to this beyond saving the multiword data to reload into the scorer. I'm hesitant to take that route with the grammars I'm providing since they are already getting a bit too large for distribution, but I could see making that an option for people using the scorer and training on their own datasets.