unblob
unblob copied to clipboard
Metadata file
Store the metadata file in extract_root in one JSON file.
We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.
For example:
class Metadata:
filename: Optional[str]
size: Optional[int]
perms: Optional[int]
endianness: Optional[str]
uid: Optional[int]
username: Optional[str]
gid: Optional[int]
groupname: Optional[str]
inode: Optional[int]
vnode: Optional[int]
@attr.define
class Chunk:
"""Chunk of a Blob, have start and end offset, but still can be invalid."""
start_offset: int
# This is the last byte included
end_offset: int
handler: "Handler" = attr.ib(init=False, eq=False)
metadata: Optional[Metadata]
we have a root File and in case of processing a directory, then we have a list of root Files FilesystemObject
- parent (Chunk, null for root)
- children (list of Chunk, could be zero)
- path
- type (File, Directory, Device, Symlink etc.)
- permission, ownership, timestamp, acl etc. (coming from the handler which extract metadata from the chunk, otherwise leave as null)
- magic/mime
- NB: we want to record metadata on "files" that are not written as part of the extraction (eg: char devices from squashfs)
Chunk
- parent (File)
- children (list of Files, could be zero)
- start/end offset
- length
- type (handler)
- tags: encryption,
- metadata key/values (FIXME)
Questions:
- how can we get the metadata? (can we get it from the extractors, are they smart enough)?
- if metadata gathering is expensive, we should probably make those optional
- do we want to store any errors (eg: extraction errors) related to files/chunks and if yes how?
Almost all of the information described above is now part of the reporting feature of unblob.
The information that is missing right now:
- meta-data about files that were not created because we run without elevated privileges (block devices, character devices)
- exact permission, ownership, and timestamp information on every file
I don't think item 1 has a lot of added value right now. Regarding item 2, we already have the structure in place to collect that information. What remains is making sure the extraction phase preserve that information so that we can simply stat the file for details.
I would take care of item 2 in two steps:
- add permission, ownership, and timestamps to
StatReports - once it's there, spend time making sure we extract or use extractors in a way that they preserve that information whenever they can. We already collected that information at https://unblob.org/formats/, and it's something our intern can do :)
On top of that, I would like to add a specific feature to our meta-data collection effort: saving header information. The idea is to have a metadata field as part of our ChunkReports, which is simply a dict where the handler developer can put relevant information, such as parsed headers.
I submitted a PR to dissect.cstruct going into that direction (see https://github.com/fox-it/dissect.cstruct/pull/29).
The idea behind this is to expose metadata to further analysis steps through the unblob report (e.g. a binary analysis toolkit would read the load address and architecture from a uImage chunk to analyze the file extracted from that chunk with the right settings).
All of these changes are quite simple to implement since reporting is already there:
diff --git a/unblob/handlers/archive/sevenzip.py b/unblob/handlers/archive/sevenzip.py
index 040b409..de171c5 100644
--- a/unblob/handlers/archive/sevenzip.py
+++ b/unblob/handlers/archive/sevenzip.py
@@ -70,4 +70,8 @@ class SevenZipHandler(StructHandler):
# We read the signature header here to get the offset to the header database
first_db_header = start_offset + len(header) + header.next_header_offset
end_offset = first_db_header + header.next_header_size
- return ValidChunk(start_offset=start_offset, end_offset=end_offset)
+ return ValidChunk(
+ start_offset=start_offset,
+ end_offset=end_offset,
+ metadata=dict(header),
+ )
diff --git a/unblob/models.py b/unblob/models.py
index 2b8431f..d101a08 100644
--- a/unblob/models.py
+++ b/unblob/models.py
@@ -88,6 +88,7 @@ class ValidChunk(Chunk):
handler: "Handler" = attr.ib(init=False, eq=False)
is_encrypted: bool = attr.ib(default=False)
+ metadata: dict = attr.ib(default={})
def extract(self, inpath: Path, outdir: Path):
if self.is_encrypted:
@@ -108,6 +109,7 @@ class ValidChunk(Chunk):
size=self.size,
handler_name=self.handler.NAME,
is_encrypted=self.is_encrypted,
+ metadata=self.metadata,
extraction_reports=extraction_reports,
)
@@ -188,7 +190,7 @@ class _JSONEncoder(json.JSONEncoder):
if isinstance(obj, bytes):
try:
- return obj.decode()
+ return obj.decode("utf-8", errors="surrogateescape")
except UnicodeDecodeError:
return str(obj)
diff --git a/unblob/report.py b/unblob/report.py
index 1b5bed1..acdabaf 100644
--- a/unblob/report.py
+++ b/unblob/report.py
@@ -4,7 +4,7 @@ import stat
import traceback
from enum import Enum
from pathlib import Path
-from typing import List, Optional, Union, final
+from typing import Dict, List, Optional, Union, final
import attr
@@ -116,6 +116,12 @@ class MaliciousSymlinkRemoved(ErrorReport):
class StatReport(Report):
path: Path
size: int
+ ctime: int
+ mtime: int
+ atime: int
+ uid: int
+ gid: int
+ mode: int
is_dir: bool
is_file: bool
is_link: bool
@@ -133,6 +139,12 @@ class StatReport(Report):
return cls(
path=path,
size=st.st_size,
+ ctime=st.st_ctime_ns,
+ mtime=st.st_mtime_ns,
+ atime=st.st_atime_ns,
+ uid=st.st_uid,
+ gid=st.st_gid,
+ mode=st.st_mode,
is_dir=stat.S_ISDIR(mode),
is_file=stat.S_ISREG(mode),
is_link=stat.S_ISLNK(mode),
@@ -181,6 +193,7 @@ class ChunkReport(Report):
end_offset: int
size: int
is_encrypted: bool
+ metadata: Dict
extraction_reports: List[Report]
Please let me know what you think about this approach.
An issue could be that in _extract_chunk after the extraction is done we call fix_extracted_directory which calls fix_permission:
def fix_permission(path: Path):
if path.is_file():
path.chmod(0o644)
elif path.is_dir():
path.chmod(0o775)
So, by the time we run StatReport.from_path to check the permission is already changed.
Also if the extraction is not running as root, the uid/gid will be inaccurate as well.
Could be also problematic in case the ownership in the format is stored using names and those names are not present on the system.
The meta-data part looks ok, though I am not sure we want to store the whole header, but rather try to standardize the stored meta information. We can also store the raw header, though in some cases there are multiple headers etc.
Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.
Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.
I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?
Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on.
I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?
I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing?
Probing question: do we want to eventually replace all extractors by our hand rolled ones? If so, then this totally makes sense. If we are to outsource extraction to external implementations, I don't want to that intimately familiarize ourselves with each format, that we'd be able to parse out these details. Some extractors have listing commands, but these need to be parsed as well, and may not contain all details we want to gather.
I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ?
My idea is to have a very thin fuse driver executed either outside of unblob or inside as a thread, that would forward[^1] all operations to the underlying filesystem, and record metadata from interesting ones, like mknod, chown, etc. See list of available operations here: https://libfuse.github.io/doxygen/structfuse__operations.html. According to my almost non-existent Mac knowledge, fuse API is supported there as well.
The complexity of this approach that we are not using the details stored in the archive/fs image, but the intent of extractor tools, e.g. if they are incomplete or just doing their own things diverging from the data format, we miss those. OTOH it would be trivial to wire up any format which has a well-behaving extractor.
[^1]: actually sanitization could take place at this level, e.g. device node creation can be skipped, symlinks validated, and so on.
So if I understand correctly, the fuse layer would allow any kind of operation like a fakeroot would. It would save the intent of the operation as metadata (uid, gid, timestamps, mode), and then proceed by doing what unblob is currently doing (setting ownership and permissions so that extraction can continue).
Correct ?
Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like 7z ? Is it possible from an unprivileged perspective ?
Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like
7z? Is it possible from an unprivileged perspective ?
That would be the idea. Unfortunately, it is a pain[^1] to make it work inside docker, because it requires access to a kernel facility on the host. Otherwise, it would work for normal users.
An alternative approach we have discussed in the past is to LD_PRELOAD/DYLIB_FORCE_LIBRARIES a shim or use some other introspection method to trace IO calls. Unfortunately, it has its own can of worms, as it may not work for e.g. commands which are statically linked to libc or call syscalls directly (e.g. go on Linux).
[^1]: It requires to pass --cap-add SYS_ADMIN --device /dev/fuse