pdfrankenstein
pdfrankenstein copied to clipboard
Python tool for bulk PDF feature extraction. This tool is a prototype.
PDFrankenstein
Python tool for bulk malicious PDF feature extraction.
Dependencies
- PyV8 (and V8) (optional: if you intend to use JS deobfuscation. Note: JS deobfuscation needs to be run in a safe environment, as you would treat any malware.
- lxml
- scandir (optional: module included in lib folder)
- postgresql and psycopg2 (optional: if you intend to use postgresql backing storage)
Usage
$ pdfrankenstein.py --help
Output to a file in delimited plain text, parses ALL files in pdf-dir/
$ pdfrankenstein.py -o file -n fileoutput.txt ~/pdf-dir
Output to an sqlite database
$ pdfrankenstein.py -o sqlite3 -n pdf-db ~/pdf-dir
Output to stdout after parsing all files listed inside file-with-pdfs
$ pdfrankensetin.py -o stdout ~/file-with-pdfs
pdf_in | PDF input for analysis. Can be a single PDF file or a directory of files. |
-d, --debug | Print debugging messages. |
-o, --out | Analysis output filename or type. Default to 'unnamed-out.*' file in CWD. Options: 'sqlite3'||'postgres'||'stdout'||[filename] |
-n, --name | Name for output database. |
--hasher | Specify which type of hasher to use. PeePDF | PDFMiner (default). PDFMiner option provides better parsing capabilities. |
-v, --verbose | Spam the terminal, TODO. |