bulk_extractor
bulk_extractor copied to clipboard
PDF Scanner Misses Most / All Emails in PDF generated by pandoc
Hi,
I was just testing out bulk_extractor. One of my tests was to create the following text file:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
When I then run bulk_extractor and point it to the text file, I get the expected in email.txt
:
# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: /tmp/1.txt
# Feature-File-Version: 1.1
0 [email protected] [email protected]\015\[email protected]
16 [email protected] [email protected]\015\[email protected]\015\[email protected]\015
32 [email protected] [email protected]\015\[email protected]\015\[email protected]
47 [email protected] \[email protected]\015\[email protected]\015\[email protected]
63 [email protected] [email protected]\015\[email protected]\015\012
I then converted the text file to a PDF using pandoc (pdflatex) and opened it in a PDF file and I can clearly see the PDFs (on a single line with spaces between them) as shown here:
and here is the related PDF: pandoc.pdf
Now I only get this when I run bulk_extractor:
# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: emails.pdf
# Feature-File-Version: 1.1
69-PDF-35 [email protected] e f@go ogle.com [email protected] ab c@go ogle.co
Finally, when I then opened the text file in Firefox and selected Print to PDF and opened the file in a PDF reader and it showed me the expected text:
and here is the related PDF: firefox.pdf
However, now when I run bulk_extractor on the generated PDF, email.txt
is empty. Is this expected behavior? Am I missing something? Thanks
Hi. Thank you for submitting the bug report. Would it be possible for you to attach the two PDFs to this ticket?
It turns out that the bulk_extractor
PDF to text program does not work the way that most PDF to text programs work, as it is designed to work with fragmented files. Instead of going to the end of the PDF file, reading a table, going to each page, creating the objects, and then interpreting the objects, scan_pdf looks for patterns within the inflated compressed streams and applies some simple heuristics. The heuristics were based on analysis of PDF files in the 2008-2014 time period. But the way the PDFs are created from text changes over time. bulk_extractor
was not designed for pdflatex
or for Firefox
PDF generators. It was designed for Microsoft Word on the Mac and Windows.
A better heuristic would be to take all of the (x,y) locations of the text, drop them into a frame buffer, and then run OCR on the frame buffer. You wouldn't need to do full OCR because you already know what the letters are. You would need to d line and word break detection. You need to find the lines so you know the order to send the characters, and you need the word break because there are no spaces encoded in PDF files.
Do you want to give this a try? bulk_extractor
has switches to dump the inflated compressed streams, and then you can write new recognizer that turns the characters into a text stream.
Thanks @simsong. I added the PDFs. Unfortunately, I don't know C++
This is an easy way Iearn!
Do you know python? I have been planning on doing a python bridge
Sent from my phone.
On Sep 18, 2022, at 8:21 PM, mthbrown @.***> wrote:
Thanks @simsonghttps://github.com/simsong. I added the PDFs. Unfortunately, I don't know C++
— Reply to this email directly, view it on GitHubhttps://github.com/simsong/bulk_extractor/issues/373#issuecomment-1250421141, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMFHLC3ELTNUO6B4P3QBDLV66WW5ANCNFSM6AAAAAAQPTWH6M. You are receiving this because you were mentioned.Message ID: @.***>
I know some Python