kaitai_struct icon indicating copy to clipboard operation
kaitai_struct copied to clipboard

Data views / substreams support

Open GreyCat opened this issue 7 years ago • 16 comments

Currently, all languages use something similar to this code, when it's time to do a substream and parse objects from substream:

seq:
  - id: foo
    size: some_size
    type: foo_class
this._raw_foo = io.readBytes(someSize());
KaitaiStream _io__raw_foo = new KaitaiStream(_raw_foo);
this.foo = new FooClass(_io__raw_foo, this, _root);

This is inefficient, especially for larger data streams - it needs to load everything into memory and then parse it from there. A more efficient approach would use using some sort of substreams in manner of data views, i.e. something like:

KaitaiStream _io__foo = _io.substream(_io.pos(), someSize());
this.foo = new FooClass(_io__foo, this, _root);

or, for instances that have known pos field, something like that:

instances:
  foo:
    pos: some_pos
    size: some_size
    type: foo_class
KaitaiStream _io__foo = _io.substream(somePos(), someSize());
this.foo = new FooClass(_io__foo, this, _root);

The devil, of course, is in the details:

  1. We need to make sure a substream functionality for every target language exists (or implement it there)
  2. Probably at least for some time we'll have to support two different ways: i.e. parsing via reading _raw_* byte arrays and parsing using substreams.
  3. We need to decide what to do in lots of other cases:
  • What do we return when reading a type-less value: i.e. user would expect a byte array or we'll switch to use substream there too? or a substream-that-can-be-used-as-bytearray? or it should be used-choosable?
  • What do to when repeat constructs exist on this field?
  • How would that mix with process - these actually require reading and re-processing the whole byte array in memory. Or should we re-implement them as stream-in => stream-out converters as well?

GreyCat avatar Nov 07 '16 14:11 GreyCat

Construct had a substream class that was used by Prefixed field. I am not familiar enough with Kaitai to provide an implementation, just posting it for reference. https://github.com/construct/construct/blob/a06d8dcf7dfe9839d340c1cef723c9b398efb17d/construct/lib/bitstream.py#L139

arekbulski avatar Jan 17 '18 02:01 arekbulski

@arekbulski Yes, that's exactly what we were looking to here :)

GreyCat avatar Jan 17 '18 10:01 GreyCat

On some runtimes (those non-statically typed) the feature might actually be counter productive. Especially if field inside has only few bytes. Remember that, on Python at least, you are replacing C-impl bytes copy and C-impl BytesIO with something where each read is wrapped in pure-python red tape. And you do that to avoid C-imple copying of few bytes.

If the subcon is mega-size, well, that changes things. You might consider adding a property to parsed field, so at compile time it emits one code or the other. User would need to opt-in for this feature.

arekbulski avatar Jan 23 '18 11:01 arekbulski

A valid concern, but I'd suggest that we benchmark these things first and then see if we need to do any workarounds. May be in languages like that implementation of .substream(...) should decide in runtime if it's worth to just do a native BytesIO + copy (if that's below a certain threshold of bytes), or wrap it in our wrapper.

Also, I'd suggest that we'd start investigating if there are any kind of bound substream class implementations available in other languages.

GreyCat avatar Jan 23 '18 11:01 GreyCat

Agreed, we need a benchmark.

arekbulski avatar Jan 23 '18 11:01 arekbulski

Hi,

Maybe this is a separate issue but I was wondering if it is possible to extend this to custom processors?

The input to the processor would be the view as described in the issue. The output a higher level object that exposes the processed content as a new kaitai stream/view with the addition of some state management methods like close() for example when the root object is destroyed. The resulting view can then again be used with the data view as described in this issue for the parsing of the processed content.

In my case I have a filesystem with potentially large encrypted partitions. What I would like to do is decrypt the partition to disk en provide a mmap'ed file to kaitai for further parsing. But I can imagine cases where the view returned by the processor would do the processing on demand (like simple xor that is already available).

I think my case can be solved by using a custom type that will do the decryption and manually call kaitai to parse its content afterwards. State management of the mmaped file would have to be done out of band.

jvisser avatar Sep 27 '18 18:09 jvisser

@jvisser Sure, it makes sense. We just need to think out the API carefully.

GreyCat avatar Sep 27 '18 19:09 GreyCat

Hey folks, please take a look, I took a first pass at implementing this for Ruby only:

  • https://github.com/kaitai-io/kaitai_struct_ruby_runtime/pull/9 is a Ruby runtime change.
  • https://github.com/kaitai-io/kaitai_struct_compiler/commit/57d8dc2adf2d3cb8b519cb29d7b15a9de738ee9b implements changes in RubyCompiler to accommodate that new runtime usage.

Primarily, changes in generated code will look like this:

   def _read
     @len1 = @_io.read_u4le
-    @_raw_block1 = @_io.read_bytes(len1)
-    _io__raw_block1 = Kaitai::Struct::Stream.new(@_raw_block1)
-    @block1 = Block.new(_io__raw_block1, self, @_root)
+    _io_block1 = @_io.substream(len1)
+    @block1 = Block.new(_io_block1, self, @_root)

So, 2 lines instead of 3, no pre-reading of bytes in the memory, cleaner API overall. I will launch the official tests now, but judging from the tests locally, tests still work as they were before.

Please tell me if you think it's a good idea and/or if we can make this work for other languages.

GreyCat avatar Jul 26 '22 00:07 GreyCat

Added similar implementation to Java — see https://github.com/kaitai-io/kaitai_struct_java_runtime/commit/ee61d73c92e855d43338ded4d0445b3b26977935 for runtime change and https://github.com/kaitai-io/kaitai_struct_compiler/commit/b529bc6b for compiler. The main difference is Java has two KaitaiStream implementations, and the good news is that one of them (ByteBuffer-based one) already has all the slicing/limits machinery in place, ready for substreams implementation.

Generated code-wise, it's a very similar change:

     private void _read() {
         this.len1 = this._io.readU4le();
-        this._raw_block1 = this._io.readBytes(len1());
-        KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(_raw_block1);
-        this.block1 = new Block(_io__raw_block1, this, _root);
+        KaitaiStream _io_block1 = this._io.substream(len1())
+        this.block1 = new Block(_io_block1, this, _root);

The big caveat is that this._raw_block1 will stay null. Shall we drop support for it, or shall we strive to keep compatibility? E.g. we can technically replace this:

    public byte[] _raw_block1() { return _raw_block1; }

with something like

    public byte[] _raw_block1() {
        KaitaiStream io = block1._io();
        long oldPos = io.pos();
        io.seek(0);
        byte[] allBytes = block1.readBytesFull();
        io.seek(oldPos);
        return allBytes;
    }

GreyCat avatar Jul 27 '22 23:07 GreyCat

Please, consider also merging https://github.com/kaitai-io/kaitai_struct_java_runtime/pull/28 and related in order to give tools a chance to correctly visualize substreams.

Mingun avatar Jul 28 '22 04:07 Mingun

@GreyCat Note that the adoption of SubIO broke the test ExprIoPos in Ruby (test_out/ruby/ci.json:384-394):

   "ExprIoPos": {
-    "status": "passed",
-    "elapsed": 0.000336907,
+    "status": "failed",
+    "elapsed": 0.000463687,
+    "failure": {
+      "file_name": "./spec/ruby/expr_io_pos_spec.rb",
+      "line_num": 4,
+      "message": "undefined method `size' for #<Kaitai::Struct::SubIO:0x0000000002666050 @parent_io=#<File:src/expr_io_pos.bin>, @parent_start=0, @parent_len=16, @parent_end=16, @pos=10, @closed=false>",
+      "trace": "/home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:145:in `size'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:30:in `_read'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:25:in `initialize'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:17:in `new'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:17:in `_read'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:12:in `initialize'\n
+      /home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:26:in `new'\n
+      /home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:26:in `from_file'\n
+      /home/travis/build/kaitai-io/ci_targets/tests/spec/ruby/expr_io_pos_spec.rb:6:in `block (2 levels) in <top (required)>'\n
+      /home/travis/.rvm/gems/ruby-3.0.4/gems/rspec-core-3.11.0/lib/rspec/..."
+    },
     "is_kst": true
   },

generalmimon avatar Jul 29 '22 08:07 generalmimon

Related: https://github.com/kaitai-io/kaitai_struct_python_runtime/issues/67

armijnhemel avatar Jul 29 '22 10:07 armijnhemel

@generalmimon

Note that the adoption of SubIO broke the test ExprIoPos in Ruby (test_out/ruby/ci.json:384-394):

Great catch, thanks! It's very easy to fix, actually, let me do that.

GreyCat avatar Jul 30 '22 08:07 GreyCat

Zero-copy substreams are great, but we should consider that they would typically require a seekable stream where seek(), pos() and size() methods are available (unless there is an ingenious way around it without having to call any of them).

So at least there should be a command-line option to bring the traditional _raw_* = read_bytes(n) + _io__raw_* = new KaitaiStream(_raw_*) approach back, otherwise support for non-seekable streams will be lost.

It's basically the same issue as in https://github.com/kaitai-io/kaitai_struct_cpp_stl_runtime/pull/46#discussion_r730420039.

generalmimon avatar Jul 30 '22 15:07 generalmimon

Makes perfect sense. We'll be supporting both anyway, e.g. for sake of supporting existing interface for custom processing which is all working exclusively on byte arrays. Let's start with a command line switch to toggle both ways.

GreyCat avatar Jul 31 '22 17:07 GreyCat

Zero-copy substreams are great, but we should consider that they would typically require a seekable stream where seek(), pos() and size() methods are available (unless there is an ingenious way around it without having to call any of them).

So at least there should be a command-line option to bring the traditional _raw_* = read_bytes(n) + _io__raw_* = new KaitaiStream(_raw_*) approach back, otherwise support for non-seekable streams will be lost.

It's basically the same issue as in kaitai-io/kaitai_struct_cpp_stl_runtime#46 (comment).

but as soon as you have read bytes, then you can just create a seekable stream that you can pass around, right?

armijnhemel avatar Jul 31 '22 18:07 armijnhemel