Add/experiment with lazy reading of structures
Ever since the conception of cstruct, we've had the idea to add lazy reading to structures. However due to numerous reasons (laziness itself probably being one of them, but also technical) we never fully experimented with it.
With the new architecture of 4.0, I think it's a bit easier to try this out.
The idea is basically as follows: for static structures (so no dynamic fields), we can eagerly read all the bytes (we already do this in the compiler) and store it in the Structure object. Then upon access of a field (struct.field_name), we only parse and store that field value.
Some implementation ideas:
- Upon creation of a statically sized structure in
StructureMetaType, change the class toLazyStructure LazyStructurehas a different__init__, which basically amounts toself.__buf = io.BytesIO(fh.read(len(self)))- For every field in the structure, generate a
@cached_property def field_nameand attach it to the class - This generated property will seek into
self.__bufand parse the type of that field
We could also make the IO fully lazy (only keep a reference to the passed in fh inside of LazyStructure), but we can run the risk of it disappearing from underneath us (i.e. a file closing).
This will only really work with static structures, because we can calculate the offsets of all fields beforehand. Technically you could support dynamic structures by allowing this technique on fields up to and including the first dynamic field, but that's probably a lot more complicated to implement.
I wonder if this will really be any faster due to the extra overhead in parsing. We've optimized the compiler quite a bit already, with optimized struct.unpack calls. With this method we instead will go through the (much slower) _read implementations of each type. The "initial" parsing of a structure will be much faster, yes, but I think you'll quickly lose out if you end up reading every field of the structure anyway.
Anyway, this ticket is for experimentation. I personally think #126 is the safer bet to gain more performance.
I wonder if this will really be any faster due to the extra overhead in parsing. We've optimized the compiler quite a bit already, with optimized struct.unpack calls. With this method we instead will go through the (much slower) _read implementations of each type. The "initial" parsing of a structure will be much faster, yes, but I think you'll quickly lose out if you end up reading every field of the structure anyway.
Anyway, this ticket is for experimentation. I personally think https://github.com/fox-it/dissect.cstruct/issues/126 is the safer bet to gain more performance.
I welcome the idea of a Rust implementation.
At the same time, I think a rust implementation and lazy parsing complement each other. The biggest win of lazy parsing IMO is not performance per se, but enabling "selective parsing" out of the box, i.e. the user is freed from having to write code on top of cstruct to parse specific substructures. (example: write something like extfs.inodes[17].inum).
So I think a rust implementation would perhaps "enable" lazy parsing, by allowing more complicated parsing logic.
The biggest win of lazy parsing IMO is not performance per se, but enabling "selective parsing" out of the box, i.e. the user is freed from having to write code on top of cstruct to parse specific substructures. (example: write something like
extfs.inodes[17].inum).
Can you elaborate a little bit more? What's the effective difference and gain here?
For example, in sqllite3.py there are Pages, which come after the header. This fact is not obvious from looking at the cstruct spec, but I inferred it from SqlLite3::raw_page.
With lazy parsing, you could write
struct SqlLite3 {
Header header
Page pages[header.page_count] # This is the syntax I think
}
struct Page {
PageHeader header
...
}
Now when parsing sqlite3 and with lazy parsing, we can write
sqlite.pages[i].header.checksum1
without needing the Page class in sqllite3.py anymore, or worry that the complete file is read.
Moreover, it has been made explicit in the cstruct spec that the pages come right after the header. Anyone looking at the cstruct definition immediately sees the overall structure of the file, not just isolated data blocks.
Caveat: One of the omitted details is that page_count might be zero. We could add "calculated fields", which are expressions of other fields to potentially deal with it..
I can recurse into Cell afterwards.
I have to say I respectfully disagree.
In my experience, such a way of parsing only practically works well a very small amount of times. It's mostly a very nice engineering exercise (both the writing of such a parsing library (I've contributed to one), as actually implementing a parser using one) and you very often end up with a bunch of hacks in the DSL to make it work, which distracts from the original goal of making the format easy to understand. You end up having to learn a second language beside the programming language you're using. I believe one of the strong suits of cstruct is that it's very easy to understand and reason about, since it maps so closely to how structures in C work.
In my opinion such ways of writing parsers primarily have added value in things like hex editors, where you only have to visualize the file structure. When you actually want to do something more complex (i.e. using the file structure) you're often times better off with more direct control.
If the goal is to be clearer about the file format, I believe the documentation should be improved.
While it would be a nice feature in itself, I don't believe there is much benefit for it in the rest of the Dissect project.
If the goal is to be clearer about the file format, I believe the documentation should be improved.
The goal is indeed to document our knowledge of file formats. but preferable using code, so that the code is the documentation. In my experience, written documentation typically gets out of sync quickly. My interest is not engineering exercises, but capturing and safeguarding the business logic we obtained as a result of painstaking reverse engineering efforts. But yeah, if we can't get to that point, the next best thing would be to write more documentation, especially if the spec of the format is not public.
Continuing this example, I was about to recurse into Cell, but had to backtrack, since it not readily apparent from the parser that there are multiple types of pages. After reading the cell class I got this, but the switching on page type logic is split between the Page and Cell classes (we can fix this though). Ultimately, I reached for the spec)
You end up having to learn a second language beside the programming language you're using.
Then parser generators are essentially ruled out, and the only technique left are parser combinators, so you have an internal DSL using the host programming language, Python. While escape hatches / hacks are very doable using this approach, I don't think combinators can be made performant enough in Python.
The goal is indeed to document our knowledge of file formats. but preferable using code, so that the code is the documentation. In my experience, written documentation typically gets out of sync quickly.
Luckily most file formats we implement don't change that often 😉.
Then parser generators are essentially ruled out, and the only technique left are parser combinators, so you have an internal DSL using the host programming language, Python. While escape hatches / hacks are very doable using this approach, I don't think combinators can be made performant enough in Python.
Besides not being usable from a performance perspective, this is also what I meant with unusable/unreadable. I've worked with a few "parsers" in "native Python syntax" and in short they all suck. It's a pain to understand how the bytes from my hexdump actually map to fields, and it's unreadable in the sense that I still need to learn a second language (the insane syntax of this parser) even though it's in Python. I think our current approach, even though flawed, is much easier to reason about and visualize.
My interest is not engineering exercises, but capturing and safeguarding the business logic we obtained as a result of painstaking reverse engineering efforts.
I didn't want to imply that you were, sorry if I sounded that way. I meant more that I just don't believe this approach to work very well in practice, especially in a field dominated with parsers originating from C or reverse engineering. Most if not all analysts and researchers will find a C-like struct mapping to bytes to be most "natural". Especially the audience that will actually look at how a parser is implemented or wishes to implement one themselves.
To summarize, it's not that I don't like the idea and the goal it would achieve, I just don't think it will work in practice, especially given the combination of the formats we implement and our target audience. Even if we could support this in the existing cstruct syntax, I think the places where we could actually apply this will be far and few between, maybe only some simple binary artifact plugins in dissect.target. But I think most other parsers are too complex for this approach. Add on the enormous engineering complexity of writing such a parser, I think our efforts are best spent elsewhere.
For example in SQLite, how are you going to implement WAL? In our current setup we can just return a different byte buffer for the requested page, how will that work if every read is hidden in the parser?
Luckily most file formats we implement don't change that often 😉.
True, but the code that parses them does (assuming that the documentation also references the python code).
Besides not being usable from a performance perspective, this is also what I meant with unusable/unreadable. I've worked with a few "parsers" in "native Python syntax" and in short they all suck. It's a pain to understand how the bytes from my hexdump actually map to fields, and it's unreadable in the sense that I still need to learn a second language (the insane syntax of this parser) even though it's in Python. I think our current approach, even though flawed, is much easier to reason about and visualize.
I can see that Python lacks the expressiveness needed to write declarative code, apart from performance concerns. I looked at the Python Construct library and while I wouldn´t use the word insane, it is definitely not elegant or beautiful.
I didn't want to imply that you were, sorry if I sounded that way. I meant more that I just don't believe this approach to work very well in practice, especially in a field dominated with parsers originating from C or reverse engineering. Most if not all analysts and researchers will find a C-like struct mapping to bytes to be most "natural". Especially the audience that will actually look at how a parser is implemented or wishes to implement one themselves.
Ok, no worries.
Personally I find the C-like struct mapping to bytes also attractive; my gripes are more with the code that is built op top of the cstruct mappings. Perhaps it is possible to use cstruct instances as the primitive parsers which are to be combined in a declarative way.
To summarize, it's not that I don't like the idea and the goal it would achieve, I just don't think it will work in practice, especially given the combination of the formats we implement and our target audience. Even if we could support this in the existing cstruct syntax, I think the places where we could actually apply this will be far and few between, maybe only some simple binary artifact plugins in dissect.target. But I think most other parsers are too complex for this approach. Add on the enormous engineering complexity of writing such a parser, I think our efforts are best spent elsewhere.
Perhaps this is true, at least given the constraints we have.
For example in SQLite, how are you going to implement WAL? In our current setup we can just return a different byte buffer for the requested page, how will that work if every read is hidden in the parser?
I don't know yet, I can try to work that out once I study a bit how WAL is implemented in SQLite.
I can see that Python lacks the expressiveness needed to write declarative code, apart from performance concerns. I looked at the Python Construct library and while I wouldn´t use the word insane, it is definitely not elegant or beautiful.
Construct is specifically one of the parsers I've used in the past, it falls apart extremely quickly for anything that's a little bit more complex, and it suffers from all the other points I mentioned before too. There's a reason we made cstruct in the first place 😄.
For example in SQLite, how are you going to implement WAL? In our current setup we can just return a different byte buffer for the requested page, how will that work if every read is hidden in the parser?
I have not studied it in detail yet, but what I grokked so far is that a page is either in the WAL file, or in the main database file.
One idea is to extend the ParserState, which ordinarily has a single stream, with actions that pushes a new stream resolved by filename, on a stack and set it as the active stream in the base ParserState. This could be done using either composition or inheritance (in a functional language you would probably employ monad transformers).
Then, the parser which parses a page can still operate on the (active) byte stream in the base ParserState, and you have a top-level parser which works on the extended ParserState, and first tries the wal file, and then then the maindb file.
The cstruct parsers can probably be easily wrapped to become primitive parsers in the framework.
But there are more fundamental problems, like:
- How to enter positional information of sub-structures. How would back-tracking work?
- How would you implement monadic comprehensions, i.e. parsers which depend of the output of other parsers, without creating a pyramid of doom of nested lambda's? I have some intuition around coroutines (two way generator functions) but need to look into it.
(pyramid of doom, we don't want this):
# The "Pyramid of Doom" of nested lambdas
# Each subsequent parser depends on the result of the previous one.
parse_length().bind(lambda length:
parse_data(length).bind(lambda data:
parse_checksum(data).bind(lambda checksum:
return_result( (length, data, checksum) )
)
)
)
I will try to write some code later, TBC.
Also, keep in mind that our goal is to create Pythonic interfaces for often not-Pythonic file formats or structures. So we will always need to wrap any sort of parsing into comfortable Python code. And since we're a DFIR/research tool, that means individual components of the file format need to represented in the Python API too.
How would you implement monadic comprehensions? I have some intuition around coroutines
Turned out this was already discovered independently by others. For example, see https://old.arrow-kt.io/docs/patterns/monad_comprehensions/, and https://parsy.readthedocs.io/en/latest/tutorial.html#using-previous-values
The idea is to turn the pyramid of doom
parse_length().bind(lambda length:
parse_data(length).bind(lambda data:
parse_checksum(data).bind(lambda checksum:
return_result( (length, data, checksum) )
)
)
)
into the following coroutine:
@do
def parse_file():
length = yield parse_length() # Note that we are yielding parser functions
data = yield parse_data(length)
checksum = yield parse_checksum(data)
return (length, data, checksum) # Technically this should be boxed in a Parser but we can leave that to `@do`.
where the generator yields parsers, and the @do decorator turns the generator into a parser.
Not too bad I would say, especially when writing generators is only required for advanced usecases.
Note that this approach is not required when aggregating results from independent parsers.
How to enter positional information of sub-structures. How would back-tracking work?
I suspect that if we constrain ourselves to strictly monotonically increasing stream positions, i.e. you can "skip" ahead but not jump back, we can avoid infinite loops, while backtracking still makes sense.
Construct is specifically one of the parsers I've used in the past, it falls apart extremely quickly for anything that's a little bit more complex, and it suffers from all the other points I mentioned before too. There's a reason we made cstruct in the first place 😄.
There is no need for publication bias; I think it makes sense to add this to the rationale of cstruct so others can learn what has been tried before, and did not work out. At least, I did not find this in the documentation.
Also, keep in mind that our goal is to create Pythonic interfaces for often not-Pythonic file formats or structures. So we will always need to wrap any sort of parsing into comfortable Python code. And since we're a DFIR/research tool, that means individual components of the file format need to represented in the Python API too.
While the native interface would be something like
parse(main_db_stream, parse_logical_page(17)) # Returns None or a Page
you can wrap this into class(ses).
The problem I guess is how you can continue parsing from the returned page result.
Perhaps this can be solved by letting parse return a tuple with the parse result and the final parser state, but have to think about it.
Maybe you should also take a look at nom: https://github.com/rust-bakery/nom or serde: https://serde.rs.
I think serde operates at a different level of abstraction, where Rust data structures are deserialized from a data format such as JSON or Bincode. Essentially, the data is already parsed from a raw byte stream into a data format, whereas our goal is to parse the raw byte stream.
Nom is like the Rust implementation of what I am aiming for. It is comparable to Haskell's Megaparsec, one the libraries I myself use for parsing text. It is interesting that Nom's primary focus is binary data. Note that I want to "wrap" cstruct parsers for use with this imaginary python parser library.
Now the next challenge will be to investigate if Python's limited expressiveness allows for a readable combining of parsers. For instance, python does not support infix functions.
It does allow operator overloading though. I suppose it is possible to let + sequence parsers, and | denote a choice between to parsers.
It is also a consideration to lean more heavily into the generator approach, which allows writing parsers in a sequential style.
One way to find out is try to translate one of our binary parsers in terms of imaginary parser combinators; I will come back with an example.