unblob Metadata file

Store the metadata file in extract_root in one JSON file.

We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.

For example:


class Metadata:
    filename: Optional[str]
    size: Optional[int]
    perms: Optional[int]
    endianness: Optional[str]
    uid: Optional[int]
    username: Optional[str]
    gid: Optional[int]
    groupname: Optional[str]
    inode: Optional[int]
    vnode: Optional[int]

@attr.define
class Chunk:
    """Chunk of a Blob, have start and end offset, but still can be invalid."""

    start_offset: int
    # This is the last byte included
    end_offset: int
    handler: "Handler" = attr.ib(init=False, eq=False)
    metadata: Optional[Metadata]

Nov 22 '21 12:11 kissgyorgy

we have a root File and in case of processing a directory, then we have a list of root Files FilesystemObject

parent (Chunk, null for root)
children (list of Chunk, could be zero)
path
type (File, Directory, Device, Symlink etc.)
permission, ownership, timestamp, acl etc. (coming from the handler which extract metadata from the chunk, otherwise leave as null)
magic/mime
NB: we want to record metadata on "files" that are not written as part of the extraction (eg: char devices from squashfs)

Chunk

parent (File)
children (list of Files, could be zero)
start/end offset
length
type (handler)
tags: encryption,
metadata key/values (FIXME)

Questions:

how can we get the metadata? (can we get it from the extractors, are they smart enough)?
if metadata gathering is expensive, we should probably make those optional
do we want to store any errors (eg: extraction errors) related to files/chunks and if yes how?

Mar 18 '22 10:03 martonilles

Almost all of the information described above is now part of the reporting feature of unblob.

The information that is missing right now:

meta-data about files that were not created because we run without elevated privileges (block devices, character devices)
exact permission, ownership, and timestamp information on every file

I don't think item 1 has a lot of added value right now. Regarding item 2, we already have the structure in place to collect that information. What remains is making sure the extraction phase preserve that information so that we can simply stat the file for details.

I would take care of item 2 in two steps:

add permission, ownership, and timestamps to StatReports
once it's there, spend time making sure we extract or use extractors in a way that they preserve that information whenever they can. We already collected that information at https://unblob.org/formats/, and it's something our intern can do :)

On top of that, I would like to add a specific feature to our meta-data collection effort: saving header information. The idea is to have a metadata field as part of our ChunkReports, which is simply a dict where the handler developer can put relevant information, such as parsed headers.

I submitted a PR to dissect.cstruct going into that direction (see https://github.com/fox-it/dissect.cstruct/pull/29).

The idea behind this is to expose metadata to further analysis steps through the unblob report (e.g. a binary analysis toolkit would read the load address and architecture from a uImage chunk to analyze the file extracted from that chunk with the right settings).

All of these changes are quite simple to implement since reporting is already there:

diff --git a/unblob/handlers/archive/sevenzip.py b/unblob/handlers/archive/sevenzip.py
index 040b409..de171c5 100644
--- a/unblob/handlers/archive/sevenzip.py
+++ b/unblob/handlers/archive/sevenzip.py
@@ -70,4 +70,8 @@ class SevenZipHandler(StructHandler):
         # We read the signature header here to get the offset to the header database
         first_db_header = start_offset + len(header) + header.next_header_offset
         end_offset = first_db_header + header.next_header_size
-        return ValidChunk(start_offset=start_offset, end_offset=end_offset)
+        return ValidChunk(
+            start_offset=start_offset,
+            end_offset=end_offset,
+            metadata=dict(header),
+        )
diff --git a/unblob/models.py b/unblob/models.py
index 2b8431f..d101a08 100644
--- a/unblob/models.py
+++ b/unblob/models.py
@@ -88,6 +88,7 @@ class ValidChunk(Chunk):
 
     handler: "Handler" = attr.ib(init=False, eq=False)
     is_encrypted: bool = attr.ib(default=False)
+    metadata: dict = attr.ib(default={})
 
     def extract(self, inpath: Path, outdir: Path):
         if self.is_encrypted:
@@ -108,6 +109,7 @@ class ValidChunk(Chunk):
             size=self.size,
             handler_name=self.handler.NAME,
             is_encrypted=self.is_encrypted,
+            metadata=self.metadata,
             extraction_reports=extraction_reports,
         )
 
@@ -188,7 +190,7 @@ class _JSONEncoder(json.JSONEncoder):
 
         if isinstance(obj, bytes):
             try:
-                return obj.decode()
+                return obj.decode("utf-8", errors="surrogateescape")
             except UnicodeDecodeError:
                 return str(obj)
 
diff --git a/unblob/report.py b/unblob/report.py
index 1b5bed1..acdabaf 100644
--- a/unblob/report.py
+++ b/unblob/report.py
@@ -4,7 +4,7 @@ import stat
 import traceback
 from enum import Enum
 from pathlib import Path
-from typing import List, Optional, Union, final
+from typing import Dict, List, Optional, Union, final
 
 import attr
 
@@ -116,6 +116,12 @@ class MaliciousSymlinkRemoved(ErrorReport):
 class StatReport(Report):
     path: Path
     size: int
+    ctime: int
+    mtime: int
+    atime: int
+    uid: int
+    gid: int
+    mode: int
     is_dir: bool
     is_file: bool
     is_link: bool
@@ -133,6 +139,12 @@ class StatReport(Report):
         return cls(
             path=path,
             size=st.st_size,
+            ctime=st.st_ctime_ns,
+            mtime=st.st_mtime_ns,
+            atime=st.st_atime_ns,
+            uid=st.st_uid,
+            gid=st.st_gid,
+            mode=st.st_mode,
             is_dir=stat.S_ISDIR(mode),
             is_file=stat.S_ISREG(mode),
             is_link=stat.S_ISLNK(mode),
@@ -181,6 +193,7 @@ class ChunkReport(Report):
     end_offset: int
     size: int
     is_encrypted: bool
+    metadata: Dict
     extraction_reports: List[Report]

Please let me know what you think about this approach.

Apr 02 '23 15:04 qkaiser

An issue could be that in _extract_chunk after the extraction is done we call fix_extracted_directory which calls fix_permission:

def fix_permission(path: Path):
    if path.is_file():
        path.chmod(0o644)
    elif path.is_dir():
        path.chmod(0o775)

So, by the time we run StatReport.from_path to check the permission is already changed.

Also if the extraction is not running as root, the uid/gid will be inaccurate as well.

Could be also problematic in case the ownership in the format is stored using names and those names are not present on the system.

Apr 02 '23 20:04 martonilles

The meta-data part looks ok, though I am not sure we want to store the whole header, but rather try to standardize the stored meta information. We can also store the raw header, though in some cases there are multiple headers etc.

Apr 02 '23 20:04 martonilles

Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

Mar 11 '24 15:03 vlaci

Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?

Mar 11 '24 17:03 qkaiser

Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.

I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?

Mar 11 '24 18:03 kissgyorgy

I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?

Probing question: do we want to eventually replace all extractors by our hand rolled ones? If so, then this totally makes sense. If we are to outsource extraction to external implementations, I don't want to that intimately familiarize ourselves with each format, that we'd be able to parse out these details. Some extractors have listing commands, but these need to be parsed as well, and may not contain all details we want to gather.

I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?

My idea is to have a very thin fuse driver executed either outside of unblob or inside as a thread, that would forward[^1] all operations to the underlying filesystem, and record metadata from interesting ones, like mknod, chown, etc. See list of available operations here: https://libfuse.github.io/doxygen/structfuse__operations.html. According to my almost non-existent Mac knowledge, fuse API is supported there as well.

The complexity of this approach that we are not using the details stored in the archive/fs image, but the intent of extractor tools, e.g. if they are incomplete or just doing their own things diverging from the data format, we miss those. OTOH it would be trivial to wire up any format which has a well-behaving extractor.

[^1]: actually sanitization could take place at this level, e.g. device node creation can be skipped, symlinks validated, and so on.

Mar 11 '24 18:03 vlaci

So if I understand correctly, the fuse layer would allow any kind of operation like a fakeroot would. It would save the intent of the operation as metadata (uid, gid, timestamps, mode), and then proceed by doing what unblob is currently doing (setting ownership and permissions so that extraction can continue).

Correct ?

Mar 12 '24 08:03 qkaiser

Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like 7z ? Is it possible from an unprivileged perspective ?

Mar 12 '24 08:03 qkaiser

Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like 7z ? Is it possible from an unprivileged perspective ?

That would be the idea. Unfortunately, it is a pain[^1] to make it work inside docker, because it requires access to a kernel facility on the host. Otherwise, it would work for normal users.

An alternative approach we have discussed in the past is to LD_PRELOAD/DYLIB_FORCE_LIBRARIES a shim or use some other introspection method to trace IO calls. Unfortunately, it has its own can of worms, as it may not work for e.g. commands which are statically linked to libc or call syscalls directly (e.g. go on Linux).

[^1]: It requires to pass --cap-add SYS_ADMIN --device /dev/fuse

Mar 12 '24 10:03 vlaci

unblob unblob copied to clipboard

Metadata file

unblob
unblob copied to clipboard