pdfrankenstein icon indicating copy to clipboard operation
pdfrankenstein copied to clipboard

Python tool for bulk PDF feature extraction. This tool is a prototype.


Python tool for bulk malicious PDF feature extraction.


  • PyV8 (and V8) (optional: if you intend to use JS deobfuscation. Note: JS deobfuscation needs to be run in a safe environment, as you would treat any malware.
  • lxml
  • scandir (optional: module included in lib folder)
  • postgresql and psycopg2 (optional: if you intend to use postgresql backing storage)


$ pdfrankenstein.py --help

Output to a file in delimited plain text, parses ALL files in pdf-dir/

$ pdfrankenstein.py -o file -n fileoutput.txt ~/pdf-dir

Output to an sqlite database

$ pdfrankenstein.py -o sqlite3 -n pdf-db ~/pdf-dir

Output to stdout after parsing all files listed inside file-with-pdfs

$ pdfrankensetin.py -o stdout ~/file-with-pdfs
pdf_in PDF input for analysis. Can be a single PDF file or a directory of files.
-d, --debug Print debugging messages.
-o, --out Analysis output filename or type. Default to 'unnamed-out.*' file in CWD. Options: 'sqlite3'||'postgres'||'stdout'||[filename]
-n, --nameName for output database.
--hasherSpecify which type of hasher to use. PeePDF | PDFMiner (default). PDFMiner option provides better parsing capabilities.
-v, --verboseSpam the terminal, TODO.


Open Source PDF Tools