bbot New internal module "extract"

This Draft PR adds an internal module "extract" which will contain several functions that can extract certain file types into folders ready for excavate to pull out useful information such as URLs, DNS_NAMEs etc.

Nov 04 '24 11:11 domwhewell-sage

Nice! This will be a fun one to build out, as we add support for every compression type and enable recursive extraction (archives within archives).

I wrote code a while back to do this in credshed, which might be useful:

Nov 04 '24 14:11 TheTechromancer

I like the mapping of compression types to extraction functions. Probably we'll need to improve on our magic filetype detection, especially get_compression(). This will keep us from relying on extensions, since there are lots of cases e.g. where you can have a zip file with a non-zip extension.

Also we might want to favor shell commands over python libraries, since CPU resources in the main process are really scarce, and offloading to tools like 7z is an effective way to parallelize.

I wrote a system just like this in credshed, where each file would get extracted, and then its contents recursively searched for more compressed files, which would each get extracted to an auto-named folder (e.g. <file_name>.extracted):

import os
import magic
import logging
import subprocess as sp
from pathlib import Path

log = logging.getLogger('credshed.filestore.util')

supported_compressions = [
    ('microsoft excel', ['ssconvert', '-S', '{filename}', '{extract_dir}/%s.csv']),
    ('rar archive', ['unrar', 'x', '-o+', '-p-', '{filename}', '{extract_dir}/']),
    ('tar archive', ['tar', '--overwrite', '-xvf', '{filename}', '-C', '{extract_dir}/']),
    ('gzip compressed', ['tar', '--overwrite', '-xvzf', '{filename}', '-C', '{extract_dir}/']),
    ('gzip compressed', ['gunzip', '--force', '--keep', '{filename}']),
    ('bzip2 compressed', ['tar', '--overwrite', '-xvjf', '{filename}', '-C', '{extract_dir}/']),
    ('xz compressed', ['tar', '--overwrite', '-xvJf', '{filename}', '-C', '{extract_dir}/']),
    ('lzma compressed', ['tar', '--overwrite', '--lzma', '-xvf', '{filename}', '-C', '{extract_dir}/']),
    ('7-zip archive', ['7z', 'x', '-p""', '-aoa', '{filename}', '-o{extract_dir}/']),
    ('zip archive', ['7z', 'x', '-p""', '-aoa', '{filename}', '-o{extract_dir}/']),
]

def extract_file(file_path, extract_dir=None):
    file_path = Path(file_path).resolve()
    if extract_dir is None:
        extract_dir = file_path.with_suffix('.extracted')
    extract_dir = Path(extract_dir).resolve()

    # Create the extraction directory if it doesn't exist
    if not extract_dir.exists():
        extract_dir.mkdir(parents=True, exist_ok=True)

    # Determine the file type using magic
    file_type = magic.from_file(str(file_path), mime=True).lower()

    # Find the appropriate decompression command
    for magic_type, cmd_list in supported_compressions:
        if magic_type in file_type:
            log.info(f'Compression type "{magic_type}" detected in {file_path}')
            cmd_list = [s.format(filename=file_path, extract_dir=extract_dir) for s in cmd_list]
            log.info(f'>> {" ".join(cmd_list)}')
            try:
                sp.run(cmd_list, check=True)
                log.info(f'Decompression successful for {file_path}')
                # Recursively extract files in the new directory
                for item in extract_dir.iterdir():
                    if item.is_file() and is_compressed(item):
                        extract_file(item, extract_dir / item.stem)
                return True
            except sp.SubprocessError as e:
                log.error(f'Error extracting file {file_path}: {e}')
                return False
    log.warning(f'No supported compression type found for {file_path}')
    return False

def is_compressed(file_path):
    file_type = magic.from_file(str(file_path), mime=True).lower()
    return any(magic_type in file_type for magic_type, _ in supported_compressions)

Nov 29 '24 22:11 TheTechromancer

Marked this ready for review now, This should be good for a base extracting the most popular compression types. I have also removed the jadx compatable compression types from libmagic so as to let that extract them instead of this module

Dec 08 '24 16:12 domwhewell-sage

@domwhewell-sage thanks for your work on this. It's looking good!

A few things:

For the .jar and .apk exclusions, we should probably hardcode those into the module instead of the helper.
The module needs either the safe or aggressive tag to pass the tests (it's safe)
We should probably have tests for:
- ~archive within archive (e.g. a .tar.gz inside a .7z)~
- archive within .jar/.apk (to test its interaction with the other modules)
What are your thoughts on naming the module unarchive or uncompress? I think maybe extract is a little too close to excavate, since it can have a dual meaning.

Dec 09 '24 21:12 TheTechromancer

Hi @TheTechromancer I have addressed all the comments but the tests for archives in .jar/.apk files as currently the module is made to handle specific archive files recursively. So would need to think about it handling folders output by jadx (so it doesn't consume its own events)

Also the tests keep failing as apt dependencies aren't getting installed for the tests for some reason is there a apt_deps that I can define for the tests?

Dec 18 '24 18:12 domwhewell-sage

Also the tests keep failing as apt dependencies aren't getting installed for the tests for some reason is there a apt_deps that I can define for the tests?

I'll add those to the core deps.

Dec 18 '24 18:12 TheTechromancer

@domwhewell-sage https://github.com/blacklanternsecurity/bbot/pull/2096 has been merged so you should be okay to remove deps_apt.

Dec 20 '24 18:12 TheTechromancer

The tests are failing because of these commands which are being executed in the class definition:

The solution should be to move them into the setup function (and preferably asyncify them):

    async def setup_after_prep(self, module_test):
        # Run the commands asynchronously
        for command in self.commands:
            process = await asyncio.create_subprocess_exec(
                *command,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )
            stdout, stderr = await process.communicate()
            assert process.returncode == 0, f"Command {command} failed with error: {stderr.decode()}":

Dec 20 '24 19:12 TheTechromancer

Thanks! A classic case of "It worked on my machine" haha

Dec 20 '24 20:12 domwhewell-sage

/sigh 🙄 It seems Debian, Arch & Fedora. All don't like 'rar' used to create the .rar test file

Debian

"stdout": "fatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"No package matching 'rar' is available\"}",

Arch

"stdout": "fatal: [localhost]: FAILED! => {\"changed\": false, \"cmd\": [\"/usr/sbin/pacman\", \"--upgrade\", \"--print-format\", \"%n\", \"rar\"], \"msg\": \"Failed to list package rar\", \"rc\": 1, \"stderr\": \"error: 'rar': could not find or read package\\n\", \"stderr_lines\": [\"error: 'rar': could not find or read package\"]}",

Fedora

"stdout": "fatal: [localhost]: FAILED! => {\"changed\": false, \"failures\": [\"No package rar available.\"], \"msg\": \"Failed to install some of the specified packages\", \"rc\": 1}",

Dec 31 '24 18:12 domwhewell-sage

/sigh 🙄 It seems Debian, Arch & Fedora. All don't like 'rar' used to create the .rar test file

Oof yeah I think the problem here is that rar is technically proprietary, so we can decompress the files but not create them. For rar specifically we'll need to attach the file to the tests.

Dec 31 '24 19:12 TheTechromancer

Codecov Report

Attention: Patch coverage is 95.65217% with 7 lines in your changes missing coverage. Please review.

Project coverage is 93%. Comparing base (b33e384) to head (2548289). Report is 6 commits behind head on dev.

Files with missing lines	Patch %	Lines
bbot/modules/internal/unarchive.py	88%	6 Missing :warning:
bbot/modules/trufflehog.py	50%	1 Missing :warning:

Additional details and impacted files

@@          Coverage Diff           @@
##             dev   #1918    +/-   ##
======================================
- Coverage     93%     93%    -0%     
======================================
  Files        372     374     +2     
  Lines      28936   29098   +162     
======================================
+ Hits       26735   26873   +138     
- Misses      2201    2225    +24

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jan 12 '25 17:01 codecov[bot]

Damn unrar-free is too old on archlinux to extract .rar files, So changing it to 7z as it can open v4 rar files. But the 7z on fedora isn't compatible with rar files

And compression type isnt being set on the lzma file for archlinux /sigh

Jan 13 '25 20:01 domwhewell-sage

Hot take what if we just comment out rar and figure it out later

Jan 13 '25 21:01 TheTechromancer