bulk_extractor PDF Scanner Misses Most / All Emails in PDF generated by pandoc

Hi,

I was just testing out bulk_extractor. One of my tests was to create the following text file:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

When I then run bulk_extractor and point it to the text file, I get the expected in email.txt:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: /tmp/1.txt
# Feature-File-Version: 1.1
0       [email protected]  [email protected]\015\[email protected]
16      [email protected]  [email protected]\015\[email protected]\015\[email protected]\015
32      [email protected]   [email protected]\015\[email protected]\015\[email protected]
47      [email protected]  \[email protected]\015\[email protected]\015\[email protected]
63      [email protected]  [email protected]\015\[email protected]\015\012

I then converted the text file to a PDF using pandoc (pdflatex) and opened it in a PDF file and I can clearly see the PDFs (on a single line with spaces between them) as shown here:

Screenshot from 2022-09-19 07-46-43

and here is the related PDF: pandoc.pdf

Now I only get this when I run bulk_extractor:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: emails.pdf
# Feature-File-Version: 1.1
69-PDF-35       [email protected]   e f@go ogle.com [email protected] ab c@go ogle.co

Finally, when I then opened the text file in Firefox and selected Print to PDF and opened the file in a PDF reader and it showed me the expected text:

Screenshot from 2022-09-19 07-48-46

and here is the related PDF: firefox.pdf

However, now when I run bulk_extractor on the generated PDF, email.txt is empty. Is this expected behavior? Am I missing something? Thanks

Sep 18 '22 21:09 mthbrown

Hi. Thank you for submitting the bug report. Would it be possible for you to attach the two PDFs to this ticket?

It turns out that the bulk_extractor PDF to text program does not work the way that most PDF to text programs work, as it is designed to work with fragmented files. Instead of going to the end of the PDF file, reading a table, going to each page, creating the objects, and then interpreting the objects, scan_pdf looks for patterns within the inflated compressed streams and applies some simple heuristics. The heuristics were based on analysis of PDF files in the 2008-2014 time period. But the way the PDFs are created from text changes over time. bulk_extractor was not designed for pdflatex or for Firefox PDF generators. It was designed for Microsoft Word on the Mac and Windows.

A better heuristic would be to take all of the (x,y) locations of the text, drop them into a frame buffer, and then run OCR on the frame buffer. You wouldn't need to do full OCR because you already know what the letters are. You would need to d line and word break detection. You need to find the lines so you know the order to send the characters, and you need the word break because there are no spaces encoded in PDF files.

Do you want to give this a try? bulk_extractor has switches to dump the inflated compressed streams, and then you can write new recognizer that turns the characters into a text stream.

Sep 18 '22 22:09 simsong

Thanks @simsong. I added the PDFs. Unfortunately, I don't know C++

Sep 19 '22 00:09 mthbrown

This is an easy way Iearn!

Do you know python? I have been planning on doing a python bridge

Sent from my phone.

On Sep 18, 2022, at 8:21 PM, mthbrown @.***> wrote:

Thanks @simsonghttps://github.com/simsong. I added the PDFs. Unfortunately, I don't know C++

— Reply to this email directly, view it on GitHubhttps://github.com/simsong/bulk_extractor/issues/373#issuecomment-1250421141, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMFHLC3ELTNUO6B4P3QBDLV66WW5ANCNFSM6AAAAAAQPTWH6M. You are receiving this because you were mentioned.Message ID: @.***>

Sep 19 '22 00:09 simsong

I know some Python

Sep 19 '22 01:09 mthbrown

bulk_extractor bulk_extractor copied to clipboard

PDF Scanner Misses Most / All Emails in PDF generated by pandoc

bulk_extractor
bulk_extractor copied to clipboard