kaitai_struct icon indicating copy to clipboard operation
kaitai_struct copied to clipboard

Following a chain of sectors to merge them in a new stream

Open GreyCat opened this issue 7 years ago • 19 comments

At least both FAT filesystem and Microsoft's CFB files follow the same pattern: to specify file contents, one provides an index of starting sector s0. A parser must follow the chain of sectors, as specified in a FAT-like table, i.e.:

  • 1st sector = a0
  • 2nd sector = fat[a0]
  • 3rd sector = fat[fat[a0]]
  • etc

until it meets certain terminator (like -1 or -2) in the FAT table. After that, if we would want to do further parsing on the file contents, we should reassemble all these individual sectors into one new stream (and probably trim it to the size specified in a separate field somewhere in the directory entry).

The following structure can model most of this behavior, but not all:

seq:
  # Offset to the FAT table, which is, for simplicity sake, consists of 4-byte entries
  - id: ofs_fat
    type: u4
  # A pointer to the first sector
  - id: file_first_sector
    type: sector_ptr
  # Full size that we need to trim file size to
  - id: file_full_size
    type: u4
types:
  fat:
    seq:
      - id: entries
        type: u4
        repeat: eos
  sector_ptr:
    seq:
      - id: current_ptr
        type: u4
    instances:
      body:
        pos: current_ptr * 512
        size: 512
        if: current_ptr != -1
      next:
        pos: _root.ofs_fat + 4 * current_ptr
        type: sector_ptr
        if: current_ptr != -1

This effectively allows to access file sectors one by one by using:

parsed.file_first_sector.body # 1st sector contents
parsed.file_first_sector.next.body # 2nd sector contents
parsed.file_first_sector.next.next.body # 3nd sector contents, etc

However, there is no simple way to unite all these sectors and trim it to file_full_size, except in the app code, to continue parsing, i.e:

data = ''
s = parsed.file_first_sector
while not s.body.nil? do
  data << s.body
  s = s.next
end

Any ideas on what would be the best syntax to do it?

GreyCat avatar Jul 02 '17 12:07 GreyCat

Should we unify this issue with other substream related issue, eg. the PNG one (data of multiple IDAT entries should be concatenated before zlib decompression).

I presume it would be better if we could come up with a more universal solution which may be good enough to support file format we haven't met yet.

Also it may worth thinking a little about serialization (but really just a little as it is really out-of-scope right now): so if we came up with multiple ideas, we can compare them from this point too.

@GreyCat could we collect all the sub-stream related file formats to somewhere (this issue's description, separate wiki page, etc)?

Current this is the list (I'll try to collect them and modify this comment):

  • Microsoft CFB
  • FAT filesystem
  • PNG
  • (registry file?)
  • (TCP?)

koczkatamas avatar Jul 02 '17 16:07 koczkatamas

Substreams, as in #44, is somewhat different issue, although may be it's worth to discuss this one after completing #44.

GreyCat avatar Jul 05 '17 13:07 GreyCat

Another complex example from Ogg specification. Each Ogg page has a list of physical "segments" defined like that:

      - id: len_segments
        type: u1
        repeat: expr
        repeat-expr: num_segments
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]

Nowadays, with advent of _index we can even read them. But the actual software works not on physical "segments", but logical "packets", which are constructed by joining "segments" which its length is 255. That is, a typical Ogg page might contain segment lengths like that:

        [.] 12 = 1       } length 1
        [.] 13 = 1       } length 1
        [.] 14 = 1       } length 1
        [.] 15 = 252     } length 252
        [.] 16 = 255     ⎫
        [.] 17 = 36      ⎭ length 291
        [.] 18 = 255     ⎫
        [.] 19 = 34      ⎭ length 289
        [.] 20 = 255     ⎫
        [.] 21 = 255     ⎪
        [.] 22 = 255     ⎪
        [.] 23 = 61      ⎭ length 826

Packet of 255 bytes is encoded as 2 segments, 255-byte segment + 0-byte segment.

To add the insult to injury, technically packets can be even split between different Ogg pages (i.e. higher level structures). This way one page might end up with a segment of 255 bytes and another one might start with a segment that continues it (and a "continuation" flag).

GreyCat avatar Jul 21 '17 09:07 GreyCat

seq:
.......
chains:
  chain_typed:
    doc: lol
    type: frame # if `chain` is set, `type` is forbidden and inherited from that chain
  chain_stream:
    doc: united stream of bytes ready to be parsed
   
types:
........
   aaaa:
     seq:
        - id: frame # a chain of objects of type `frame`
          chain: _root.chain_typed
        - id: size
          type: u8
        - id: byte_chunk # a chain of raw bytes, may be used for parsing after merging via `_io`
          size: size
          chain: chain_stream

What do you think?

KOLANICH avatar Aug 28 '17 20:08 KOLANICH

Would you care to elaborate a little on how's that supposed to work, so I won't be reinventing the whole thing from the very beginning, trying to guess what you've meant here?

GreyCat avatar Aug 28 '17 20:08 GreyCat

chains is the dictionary of chains identifiers. It is by definition a collection. Implementation is to be decided, I guess different languages will have differrent ones. For C++ I guess we need chain to be a vector and adding to the chain are move semantics operations like emplace_back. For reference types in GC languages I guess it's just a collection. Chain is typed. If type is ommited it is by definition a collection of raw bytes (with array-like and stream interfaces). A chain has address as any property has. In fact it is a property, as any field in seq and any instance are. So it has an address.

chain bounds any property to its chain by its path. Binding property to chain means that KSC adds some code inserting pointer/reference/actual content to the recently parsed property to the chain.

KOLANICH avatar Aug 28 '17 20:08 KOLANICH

Before we implementing anything concrete I'd like to see how the suggested solution solves the issues mentioned above (Microsoft CFB, FAT filesystem, PNG, registry file?, TCP?, Ogg, referenced issue).

Finding a fits-them-all solution is probably not easy.

koczkatamas avatar Aug 28 '17 20:08 koczkatamas

For ogg do you mean something like this:

.....
types:
  page:
    chains:
      data: {}

    seq:
      ....
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]
        doc: Segment content bytes make up the rest of the Ogg page.
        chain: data

The proposed syntax should merge all the page's segments to data stream belonging to that page

KOLANICH avatar Aug 30 '17 06:08 KOLANICH

I'm not sure if this is directly related to this enhancement proposal but I've encountered a related issue when attempting to build a struct to describe MPEG-TS protocol captures. MPEG-TS consists of lots of small packets (188 bytes each) which have program identifiers and counters in their headers.

Ideally, it would be possible to use Kaitai to not only split a capture in the 188 byte packets but also merge the payloads of packets belonging to a given program identifier according to their specified order (e.g. demultiplex) and then parse that re-assembled payload with its own Kaitai structure.

Right now, I have to pop out of Kaitai into python to do this merging process, generate an intermediary binary containing the demultiplexed payloads, and then pop back into Kaitai with a different .ksy to parse this intermediary format. It'd be great to do this all inside Kaitai for cross-language portability and clarity.

pavja2 avatar Mar 04 '19 11:03 pavja2

@pavja2 Could you demonstrate merging algorithm for our reference here? In indeed looks like it's another valid use case for this feature.

GreyCat avatar Mar 09 '19 11:03 GreyCat

@pavja2 I was also about to implement an mpeg-ts parser do u have your structure file shared somewhere?

kalidasya avatar May 06 '19 18:05 kalidasya

@kalidasya This person here actually claims that she has developed that, although I'm not sure if they'll be able to open source it.

GreyCat avatar May 06 '19 18:05 GreyCat

@kalidasya I do have a basic struct file. It's rough around the edges but works well enough for my needs. It'd be awesome if someone made it better! I'm traveling ATM but will post it soon as I can and let you know when I do.

pavja2 avatar May 06 '19 19:05 pavja2

@kalidasya @pavja2 Guys, just a heads up: please consider creating a new issue in formats repo for that format and move the discussion there? Otherwise, it will be virtually impossible for others who might be interested to find these and join your cause :)

GreyCat avatar May 06 '19 19:05 GreyCat

@GreyCat thanks for the tip, issue created! @pavja2 when you have time can you link it to the referred ticket?

kalidasya avatar May 06 '19 21:05 kalidasya

To give some context for this ticket from mpeg-ts point of view:

  1. every mpeg-ts stream consist of multiple 188 bytes long packet, each packet (among other things) has a packet identifier and a payload data
  2. this chain should work for this use case but we need to address the data as a dictionary:
.....
types:
  payload:
    chains:
      data: {}

  tspacket:
    seq:
      ....
      - id: payload_unit_start_indicator
        type: b1
     ...
      - id: pid
        type: b13
      ...
      - id: payload
        size: eos 
        chain: data[pid]

@pavja2 what do you think? what complicates things in this case that every ts packet has a payload_unit_start_indicator which flags if this is a beginning of a new payload (this is the trigger point to start to parse the payload what we accumulated so far). My kaitai knowledge is not enough to assess is it something which can be covered or not.

kalidasya avatar May 09 '19 13:05 kalidasya

Moved from #555

In the ksy, is it possible to combine all value fields from data_chunk into a new _io stream? I need to combine(concatenate) the data into one stream for further processing without checksums.

types:
  data_chunk:
    seq:
      - id: value
        size: '(_io.size - _io.pos > 17) ? 16 : ((_io.size - _io.pos) - 2)'
      - id: checksum
        size: 2

Update:

Example data_chunks: obraz

Each data_chuck has a field named "value" and "checksum". What I want to do is to get one string containing bytes from all the fields named "value" without data from the "checksum" fields.

obraz

jaroslaw-wieczorek avatar Jun 10 '19 08:06 jaroslaw-wieczorek

Hi hi! This would be handy for USB as well. For example with usb mass storage, with full speed usb packets are 64 bytes at a time but reads and writes of data are done in 512 byte chunks over multiple low level (IN) packets. This is also done for things like long descriptor strings.

tannewt avatar Aug 06 '19 02:08 tannewt

Squashfs (https://github.com/kaitai-io/kaitai_struct_formats/pull/596) would also benefit from this. Metadata is stored in blocks, that need to be processed individually and then concatenated, before being able to parse it.

I have it working using custom functions, but native support would be great.

tisoft avatar Jul 26 '22 12:07 tisoft