PDFrankenstein

Python tool for bulk malicious PDF feature extraction.

Dependencies

PyV8 (and V8) (optional: if you intend to use JS deobfuscation. Note: JS deobfuscation needs to be run in a safe environment, as you would treat any malware.
lxml
scandir (optional: module included in lib folder)
postgresql and psycopg2 (optional: if you intend to use postgresql backing storage)

$ pdfrankenstein.py --help

Output to a file in delimited plain text, parses ALL files in pdf-dir/

$ pdfrankenstein.py -o file -n fileoutput.txt ~/pdf-dir

Output to an sqlite database

$ pdfrankenstein.py -o sqlite3 -n pdf-db ~/pdf-dir

Output to stdout after parsing all files listed inside file-with-pdfs

$ pdfrankensetin.py -o stdout ~/file-with-pdfs

pdf_in	PDF input for analysis. Can be a single PDF file or a directory of files.
-d, --debug	Print debugging messages.
-o, --out	Analysis output filename or type. Default to 'unnamed-out.*' file in CWD. Options: 'sqlite3'\|\|'postgres'\|\|'stdout'\|\|[filename]
-n, --name	Name for output database.
--hasher	Specify which type of hasher to use. PeePDF \| PDFMiner (default). PDFMiner option provides better parsing capabilities.
-v, --verbose	Spam the terminal, TODO.