kaitai_struct_python_runtime icon indicating copy to clipboard operation
kaitai_struct_python_runtime copied to clipboard

Alternative runtime without `struct`

Open GreyCat opened this issue 7 years ago • 5 comments

Python's struct module seems to be pretty inefficient for our purposes. Namely, in all APIs it provides, it requires passing a format string into unpack-like function, which then parses that format string in runtime, calls relevant unpack methods, and then constructs a tuple with a single value, which we extract right away.

Actually, struct even has everything we need — for example, these are functions which read ("unpack") integers, but it's not exposed as Python API.

Would it make sense / be faster to introduce alternative, native Kaitai Struct API which would be written in C, but would be faster than existing one?

Cc @koczkatamas @KOLANICH @arekbulski

GreyCat avatar Mar 06 '18 11:03 GreyCat

I will reach out to python mailing list, the guys there are very helpful with advice, and knowledgable too.

There is a way to pre-compile a formatstring into a packer object, but it does also return a tuple. https://docs.python.org/3/library/struct.html#classes

>>> timeit.timeit("struct.unpack('=b', b'x')", "import struct")
0.18020300199714256
>>> timeit.timeit("p.unpack(b'x')", "import struct; p = struct.Struct('=b')")
0.11721895999653498

arekbulski avatar Mar 06 '18 12:03 arekbulski

Would it make sense / be faster to introduce alternative, native Kaitai Struct API which would be written in C

IMHO no.

1 C is fast, but I wonder if Rust may be better here. 2 If make a better API, let it be a part of python, not a standalone library. 3 as @arekbulski has mentioned, it is possible to precompile the structs parsers. Of course it is the work of KSC to merge adjacent fields into a single struct if it is possible. I wonder if makes any sense to do some flattening.

meta:
 id: l_l_l
seq:
 - id: a
   type: u4
 - id: b
   type: u4
 - id: c
   type: aa
 - id: d
   type: f8
types:
  aa:
    seq:
      - id: b
        type: u1

now (simplified, only conducts sense)

class LLL(...):
  ...
    a=unpack("I", ...)
    b=unpack("I", ...)
    c=Aa(...)
    d=unpack("d")

with precompilation

class LLL(...):
  ab_unp=Struct("II")
  d_unp=Struct("d") # in fact we can precompile for single bytes once and reuse.
  ...
    a, b=self.__class__.ab_unp.unpack(...)
    c=Aa(...)
    d = self.__class__.d_unp.unpack(...) 

with flattening:

class LLL(...):
  abcd_unp=Struct("IIBd")
  ...
    a, b, c_b, d=self.__class__.abcd_unp.unpack(...)
    c=Aa._from_unpacked_tuple((c_b,))

KOLANICH avatar Mar 06 '18 20:03 KOLANICH

I suggest closing this topic. I think we have already arrived at a conclusion: Implementing Python parser (not runtime) in C would be a major hurdle that would not even be worth the effort. And we already exhausted what can be done in Pure python.

arekbulski avatar Apr 09 '18 00:04 arekbulski

If dropping Python 2 support is actually an option, then it might be worth looking at the from_bytes() method that is standard in the int module. It basically works like this:

int.from_bytes(byte_string, byteorder=byteorder, signed=signed)

(default: unsigned)

for example:

>>> int.from_bytes(b'\x01\x01\x00\x00', byteorder='big')
16842752
>>> int.from_bytes(b'\x01\x01\x00\x00', byteorder='little')
257

It also allows converting arbitrary length byte strings, so something like implementing u6 becomes very trivial:

>>> int.from_bytes(b'\x01\x01\x00\x00\x00\x00', byteorder='little')
257
>>> int.from_bytes(b'\x01\x01\x00\x00\x00\x00', byteorder='big')
1103806595072

https://docs.python.org/3/library/stdtypes.html#int.from_bytes

armijnhemel avatar Apr 13 '22 10:04 armijnhemel

For arrays of numbers we probably should use array

KOLANICH avatar Apr 13 '22 11:04 KOLANICH