bulk_extractor icon indicating copy to clipboard operation
bulk_extractor copied to clipboard

bulk_extractor needs a quote printable decoder

Open Donovoi opened this issue 3 years ago • 12 comments

Hi, apologies if you have already addressed this.

I ran bulk extractor version 2.0.1 against a windows 7 raw dd image containing user data as well as the base windows system.

I noticed that in the email histogram that had been produced, one of the emails was shown as [email protected] (i've removed the domain so I dont get in trouble)

I used X-Ways to search this email address in the data and the only hits I could find are .eml files showing <a [email protected] I believe this is not the actual address and is actually quoted-printable encoding. Here is some more for context:

<A [email protected] =
href=3D"mailto:[email protected]">Klaus non=20
  Redacted</A> </DIV>
  <DIV style=3D"FONT: 10pt arial">

and here is where I learnt about it https://stackoverflow.com/a/4016098

Let me know if this has already been address or if you need more info.

Thank you for your work!

Donovoi avatar Jun 29 '22 11:06 Donovoi

Thanks for the posting. I changed the title to reflect what you need.

bulk_extractor doesn't parse files. It looks at bulk data. The problem you have here is that there is quote-printable material that is not being decoded. In your example, it looks like the program would recover both [email protected] and [email protected]. Can you verify that it did?

If we add a quote printable decoder it will then recover 3 email addresses in your example, because [email protected] will parse as both [email protected] and then, with a longer forensic path, [email protected]. You could remove the first with post-processing.

question - in your question you say that the program found [email protected]. However, given your example, it should have found [email protected]. Can you check this for me?

simsong avatar Jun 29 '22 12:06 simsong

Hi @simsong thanks for your reply. Yes I can confirm it recovered both email addresses.

Hi sorry I should have been more clear. It did find 3D - but I believe there is a FLAG that makes all emails in the histogram lowercase? https://github.com/simsong/bulk_extractor/blob/17c2a0d52d67f3dd9bb46f62ed8678c6e48cf525/src/scan_email_lg.cpp#L232

I could be wrong, I'm not a CPP programmer.

Donovoi avatar Jun 29 '22 12:06 Donovoi

You are correct. In the histogram the emails are lowercased. Do you think that a quote printable decoder is worth doing? It's not hard. Do you want to become a C++ programmer?

simsong avatar Jun 29 '22 12:06 simsong

Haha! I would love to! Thank you for the opportunity!

Don't expect anything on par with your work, but I can give it a go :)

Donovoi avatar Jun 29 '22 12:06 Donovoi

It's far easier to develop on Linux or Mac than Windows. Are you okay with that?

simsong avatar Jun 29 '22 12:06 simsong

Great. Why don't you try to build under the current Fedora? If you can spin up a VM and build it, I can then give you step-by-step instructions on how to develop the quote-printable decoder. We won't have a method for the decoder to suppress the false positive, but it may be useful for other purposes. And you'll learn something!

simsong avatar Jun 29 '22 13:06 simsong

Woohoo!

Well, I've just created a fedora 36 workstation instance inside QEMU/KVM (which is inside WSL2) it is working well.

I'll await to hear from your regarding next steps.

Thanks again!

Donovoi avatar Jun 30 '22 08:06 Donovoi

Great. You need to do a git clone --recursive on this repo and then apply the script in the etc directory and then verify that you can build and execute the self tests. If you need help to do this, let me know, and I'll develop a readme with you in the repo. We will then expand the readme so that people learn how to develop new modules. Sound cool?

simsong avatar Jun 30 '22 17:06 simsong

Sounds great!

I have successfully run the tests via regress.py.

It does say that some features were not found:

Now reading features from data_check.txt
b'Data/Base64_files/EmailText/RADIX64\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-0-MSXML-2' not found b'[email protected]'
b'Data/Base64_files/EmailText/RFC1421\xf4\x80\x80\x9c-106-BASE64-2322-ZIP-1213' not found b'[email protected]'
b'Data/Base64_files/EmailText/RFC1642\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-1213' not found b'[email protected]'
b'Data/Base64_files/EmailText/RFC2045\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-0-MSXML-2' not found b'[email protected]'
b'Data/Base64_files/EmailText/RFC3548\xf4\x80\x80\x9c-0-BASE64-2423-ZIP-0-MSXML-30' not found b'[email protected]'
b'Data/Base64_files/JEPG/RFC 1421\xf4\x80\x80\x9c-0-BASE64-0' not found b'057b7e3d9e7a3a3db3e147a6ce16e786'
Total features found: 66
Total features not found: 6

But everything else seems to work as expected.

Donovoi avatar Jul 01 '22 05:07 Donovoi

Sorry for the delay in getting back to you. I've been dealing with a server-down situation on simson.net.

Anyway, the regress.py is a Version 1.0 system. The test for version 2.0 is bin/test_be which runs all of the unit tests. But it looks like you've got this working.

Congrats!

Now the thing to do is to create a branch with git. Let's call it dev-quote-printable. I can add you as a contributor to this repo, or you can fork and do your own.

Have you read the bulk_extractor programmer's manual? I haven't compiled it in a while. Probably the best way for us to do this would be for you to read the manual and then put questions in it, and I'll answer them. In this way the manual will get better.

So here's what you need to do:

  1. Create a scan_quoteprintable.cpp file based one one of the other scanners and hook it in to the autoconf system. Your first version of the scanner should not do anything but init and deinit and register its metadata.
  2. Add scan_quoteprintable to bulk_extractor_scanners.h. (This is new with version 2.0 and the programmer's manual needs to be updated.)
  3. Run bulk_extractor and verify that your scanner appears in the scanner list.
  4. Now you need to make your scanner recognize quote-printable and unquote it. You will do this by scanning the sbuf, looking for quote printable, and writing to a stringstream. Once you catch a certain number of them, you'll make an sbuf with the stringstream and execute a recursive call. I can show you where this happens, and it should be properly documented.
  5. Now you need to create a unit test.
  6. Finally, we need to think about how to suppress false positives. That's more art than science.

We might also want to create new hook in the feature recorder so that passthrough features are automatically discarded. That is, these two features are probably the same and the second should not be reported:

1234567    [email protected]
1234000-QUOTEPRINTABLE-467 [email protected]

If this sounds like something you can do, I can create the blank scanner to get you going.

simsong avatar Jul 02 '22 15:07 simsong

Thank you for those instructions! I'll have a read of the manual and fork the repo just so I can make mistakes and not have it be a be a problem on someone else's blood, sweat, and tears ha

This will take a bit of time for me as I'll need to learn a few things and juggle some assignments. But hopefully I will have something of a draft within the week. Not promising anything as something might come up.

I'll be sure to post any questions on the programmers manual.

Leave it with me 😁

Donovoi avatar Jul 03 '22 18:07 Donovoi