pcfg_cracker icon indicating copy to clipboard operation
pcfg_cracker copied to clipboard

Support gzip input in trainer.py

Open matrix opened this issue 2 months ago • 0 comments

Hi,

with this patch is possible using gzipped training file.

POC

$ python3 trainer.py -t ~/works/wl3/EnciclopediaItaliana.txt.gz -r test

    ____            __  __           ______            __   
   / __ \________  / /_/ /___  __   / ____/___  ____  / /  
  / /_/ / ___/ _ \/ __/ __/ / / /  / /   / __ \/ __ \/ /   
 / ____/ /  /  __/ /_/ /_/ /_/ /  / /___/ /_/ / /_/ / /     
/_/ __/_/_  \___/\__/\__/\__, /   \__________/\____/_/     
   / ____/_  __________  __/_/_   / ____/_  _____  _____________  _____
  / /_  / / / /_  /_  / / / / /  / / __/ / / / _ \/ ___/ ___/ _ \/ ___/
 / __/ / /_/ / / /_/ /_/ /_/ /  / /_/ / /_/ /  __(__  |__  )  __/ /    
/_/____/__,_/ /___//__/\__, /   \____/\__,_/\___/____/____/\___/_/     
 /_  __/________ _(_)___ /_/_ _____        
  / / / ___/ __ `/ / __ \/ _ \/ ___/        
 / / / /  / /_/ / / / / /  __/ /            
/_/ /_/   \__,_/_/_/ /_/\___/_/       

Version: 4.7

-----------------------------------------------------------------
Attempting to autodetect file encoding of the training passwords
-----------------------------------------------------------------
File Encoding Detected: utf-8
Confidence for file encoding: 0.99
If you think another file encoding might have been used please 
manually specify the file encoding and run the training program again

-------------------------------------------------
Performing the first pass on the training passwords
What we are learning:
A) Identify words for use in multiword detection
B) Identify alphabet for Markov chains
C) Duplicate password detection, (duplicates are good!)
-------------------------------------------------

Printing out status after every million passwords parsed
------------

Number of Valid Passwords: 281923
Number of Encoding Errors Found in Training Set: 0


WARNING:
   No duplicate passwords were detected in the first 100000 parsed passwords

    This may be a problem since the training program needs to know frequency
    info such as '123456' being more common than '629811'

-------------------------------------------------
Performing the second pass on the training passwords
What we are learning:
A) Learning Markov (OMEN) NGRAMS
B) Training the core PCFG grammar
-------------------------------------------------

Printing out status after every million passwords parsed
------------

-------------------------------------------------
Calculating Markov (OMEN) probabilities and keyspace
This may take a few minutes
-------------------------------------------------

OMEN Keyspace for Level : 1 : 120
OMEN Keyspace for Level : 2 : 1372
OMEN Keyspace for Level : 3 : 8760
OMEN Keyspace for Level : 4 : 42617
OMEN Keyspace for Level : 5 : 179047
OMEN Keyspace for Level : 6 : 653564
OMEN Keyspace for Level : 7 : 2109901
OMEN Keyspace for Level : 8 : 6062042
OMEN Keyspace for Level : 9 : 16246605
OMEN Keyspace for Level : 10 : 41594143
OMEN Keyspace for Level : 11 : 103770350
OMEN Keyspace for Level : 12 : 260785409
OMEN Keyspace for Level : 13 : 711119473
OMEN Keyspace for Level : 14 : 2486041579

-------------------------------------------------
Performing third pass on the training passwords
What we are learning:
A) What Markov (OMEN) probabilities the training passwords would be created at
-------------------------------------------------


-------------------------------------------------
Top 5 e-mail providers
-------------------------------------------------


-------------------------------------------------
Top 5 URL domains
-------------------------------------------------

;7"“'i/"lg.no : 1
ciicnrbit.it : 1
crora.no : 1
crsmologia.ca : 1
(ctir.ru : 1

-------------------------------------------------
Top 10 Years found
-------------------------------------------------

1900 : 1
1908 : 1
1911 : 1
1916 : 1
1943 : 1
1947 : 1
1969 : 1
1970 : 1
1974 : 1
1975 : 1

-------------------------------------------------
Saving Data
-------------------------------------------------

PW Length 1 : (10, 0)
PW Length 2 : (11, 0)
PW Length 3 : (12, 0)
PW Length 4 : (5, 15635)
PW Length 5 : (6, 26304)
PW Length 6 : (7, 33386)
PW Length 7 : (7, 38508)
PW Length 8 : (8, 39048)
PW Length 9 : (10, 34992)
PW Length 10 : (11, 28946)
PW Length 11 : (12, 21190)
PW Length 12 : (13, 13692)
PW Length 13 : (15, 8181)
PW Length 14 : (17, 4731)
PW Length 15 : (18, 2642)
PW Length 16 : (20, 1473)
PW Length 17 : (21, 730)
PW Length 18 : (23, 439)
PW Length 19 : (25, 233)
PW Length 20 : (26, 165)
PW Length 21 : (27, 133)
$ ../compiled-pcfg-matrix/pcfg_guesser -r Rules/test
    ____            __  __           ______            __              
   / __ \________  / /_/ /___  __   / ____/___  ____  / /              
  / /_/ / ___/ _ \/ __/ __/ / / /  / /   / __ \/ __ \/ /               
 / ____/ /  /  __/ /_/ /_/ /_/ /  / /___/ /_/ / /_/ / /                
/_/ __/_/_  \___/\__/\__/\__, /   \__________/\____/_/                 
   / ____/_  __________  __/_/_   / ____/_  _____  _____________  _____
  / /_  / / / /_  /_  / / / / /  / / __/ / / / _ \/ ___/ ___/ _ \/ ___/
 / __/ / /_/ / / /_/ /_/ /_/ /  / /_/ / /_/ /  __(__  |__  )  __/ /    
/_/    /__,_/ /___//__/\__, /   \____/\__,_/\___/____/____/\___/_/     
                         /_/                                           
 ---------------------------> PURE C EDITION!!!                       
Version: 4.1
Loading Ruleset:Rules/test/
Initailizing the Priority Queue
Starting to generate guesses
dell
gt
mente
lt
zione
gt,
all
vano
nell
quest
dell,
gt.
dall
zioni
rono
lt,
mento
de
dell.
menti
deu
acqua
gt-
altra
lt.
all,
altro
amp
opera
mente,
deir
coll
azione
l,
Dell
de,
quell
zione,
dell-

matrix avatar Nov 02 '25 13:11 matrix