kaitai_struct_python_runtime icon indicating copy to clipboard operation
kaitai_struct_python_runtime copied to clipboard

Overhaul of the architecture

Open KOLANICH opened this issue 7 years ago • 7 comments
trafficstars

started using memory-mapping files and added from_any method

KOLANICH avatar Mar 21 '18 17:03 KOLANICH

@GreyCat

KOLANICH avatar Mar 29 '18 19:03 KOLANICH

I have tried to refactor the architecture of the runtime to make it composable. It would require some modifications to KSC.

  1. KaitaiStruct is a class for use my a user
  2. _KaitaiStruct is a class for use by a compiler. It skips some checks related to entrance.
  3. KaitaiParser is a class to encapsulate parsing logic.
  4. IKaitaiDownStream is an interface describing low-level interacton to streams. It manages the underlying resource.
  5. KaitaiStream inherits methods and props from classes implementing IKaitaiDownStream via passthrough.
  6. KaitaiStream manages IKaitaiDownStream
  7. KaitaiStream is managed by a user explicitly using context managers. If a user needs just parsing, he should use from_any. If a user needs serialization, he has to manage lifetime of the stream explicitly.

KOLANICH avatar Jan 25 '20 19:01 KOLANICH

I'm coming a bit late to this PR, so maybe there's some context I'm missing. What's the motivation behind adding support for memory-mapped IO?

Mainly:

  • theoretical possibility to work with files larger than physical RAM.
  • serialization support.

All of it would require major changes in the code generated by the KSC and the runtime. But they are not part of this PR, let's do it step by step.

The plan is to store only the metadata in additional memory and do all the operations on the mapped memory directly. So when reading for example a u1 we just calc its offset in the map and read it from there. When writing we just write by the offset. The OS automatically swaps pages.

Of course it is suitable not for all use cases so a user should be able to choose which IO model he needs.

Do you have a particular use case where the stream-based IO operations are a performance bottleneck? Also, as far as I can tell, the runtime uses the memory-mapped file exactly like a stream-based file - does memory-mapped IO offer any actual performance benefits this way?

I haven't done any measurements about this in fact.

* What's the need for splitting `KaitaiStruct` into a hierarchy of three classes (`_KaitaiStruct_`, `_KaitaiStruct`, `KaitaiStruct`)? Is there any advantage to using the superclasses, which have less functionality than `KaitaiStruct`? Also, it's quite confusing to have three classes with identical names except for the number of underscores.
* What's the purpose of `KaitaiParser`? It appears to be completely unimplemented.

_KaitaiStruct doesn't have checks on openness, KaitaiStruct does. Both use composition for a parser. The idea is following:

  1. KSC implements KaitaiParser with parsing code that is currently in _read.
  2. KSC inherits either KaitaiStruct, setting its _parser to the generated parser and also creating props and __slots__. The parsing and serialization machinery goes to KaitaiParser-inherited class (we may need a better name), the storage and accessor machinery goes to _KaitaiStruct-inherited class.
  3. KSC uses _KaitaiStruct (via super()) where no openness checks are needed, such as parsing struct within seq fields.
  4. _KaitaiStruct_ is meant to be used by user-implemented opaque types when no machinery of _KaitaiStruct is needed.
* What's the need for `IKaitaiDownStream` and its implementations? All objects that it wraps (file objects, `io.BytesIO`, `mmap.mmap`) support normal Python stream operations, so there should be no need for a wrapper interface/ABC.

To support context management properly.

  • Python streams have a flaw because of compatibility to python 2 - they are entered when constructed. The wrappers enter them in __enter__ by delaying their construction till that moment, if that is needed. * It also provides a magic to detect wheter an object is __enter__ed.
  • And a magic to make sure it is __exit__ed when deleted.

Some general points about the code:

* There are some blatant errors in the code that make it impossible to import, let alone run correctly. I know you said that changes in the compiler would be needed to use this new version of the runtime, so I can understand that you haven't run a full test suite over this. But please do some basic testing before you submit this as a PR and ask for changes in the compiler. (Or if the PR is not yet finished, please mark it as WIP somehow.)
* The explanations from [#25 (comment)](https://github.com/kaitai-io/kaitai_struct_python_runtime/pull/25#issuecomment-578437146) should be included in the code as comments/docstrings in the relevant places. Future readers of the code shouldn't have to read a PR discussion to understand the architecture.

I currently consider this as a draft, it will likely be changed a lot. I have pushed it to just get your feedback about the proposed architecture.

* Many identifiers use dromedaryCase, which is not the usual Python naming convention.

Don't see any problem in fixing it.

* This PR breaks compatibility with Python 3.4 and older, including Python 2.7. (`typing` was added to the stdlib in 3.5, `pathlib` was added in 3.4, function type annotation syntax was added in 3.0).

It is officially EOL. Typing is added for clarity, this PR adds a lot of complexity, so it also need something for clarity in order not to get mad. I can remove the typing if 2.7 support is still needed. Or we can try to use the stuff like 3to2 for that.

* This PR includes type hints in some places, but not all. If we decide to use type hints, they should be applied consistently everywhere, but again I think that should be a separate PR.

Yes, I gonna use monkeytype for that. They can be added later.

KOLANICH avatar Jan 26 '20 10:01 KOLANICH

  • theoretical possibility to work with files larger than physical RAM.

I don't have a lot of experience with mmap in Python, but from what I've read, the data from the file is copied by default. This happens regardless of whether you use the bytearray-like interface (f[12:34]) or the IO-stream-like interface (f.read(123)). The reason for this is probably that the data is returned as a bytes object, which is supposed to be immutable, so the data has to be copied to ensure that it cannot change (which could happen if the mmap file is opened as writable and you write to a position that you previously read from).

To prevent this automatic copying behavior and save memory as you intended, I think you need to wrap the mmap file in a memoryview, which lets you slice the memory without copying it. Unfortunately memoryview doesn't support the usual Python IO stream methods (unlike mmap), so you'd probably have to write a wrapper class that adds an IO stream interface on top of memoryview. (io.BytesIO doesn't work here - you can pass it a memoryview object, but that copies its contents and doesn't reuse the memory.)

_KaitaiStruct doesn't have checks on openness, KaitaiStruct does.

  1. KSC uses _KaitaiStruct (via super()) where no openness checks are needed, such as parsing struct within seq fields.

Do you mean that if a type has only seq fields and no instances, it would not support __enter__/__exit__? I'm not sure if that's a good idea - this way there would be some types that require with to be used properly, and others that cannot use with.

  1. _KaitaiStruct_ is meant to be used by user-implemented opaque types when no machinery of _KaitaiStruct is needed.

I assume this would be optional? Because currently opaque types don't require any specific base class. Also, I'm not sure if the constructor signature is correct - as far as I can tell, opaque types only receive an _io argument, and never _parent and _root.

  1. KSC implements KaitaiParser with parsing code that is currently in _read.
  2. KSC inherits either KaitaiStruct, setting its _parser to the generated parser and also creating props and __slots__. The parsing and serialization machinery goes to KaitaiParser-inherited class (we may need a better name), the storage and accessor machinery goes to _KaitaiStruct-inherited class.

OK, but what's the reason for separating the parser from the struct? The two are still tightly coupled - each parser can only work with one struct type, and you cannot create a struct object without a parser.

  • Python streams have a flaw because of compatibility to python 2 - they are entered when constructed.

I don't think it's a compatibility thing... the open function just does exactly what the name says :)

The wrappers enter them in __enter__ by delaying their construction till that moment, if that is needed.

I don't think this behavior is a good idea. It's different from how every other IO stream in Python works. There's also no good way to use these classes without with, which is still necessary in some cases, and in the REPL it's just more convenient.

dgelessus avatar Jan 27 '20 01:01 dgelessus

Do you mean that if a type has only seq fields and no instances, it would not support enter/exit?

No, I mean either the top type manages the KaitaiStream or programmer-written code manages it. All the subtypes are meant to be used only within the top one. So a context of a KaitaiStream is meant to be already entered when using any of subtypes, so they don't need an own context manager.

others that cannot use with.

If we deal with a resource that have to be explicitly managed we must use either context management magics or their surrogates (i.e. open, close), the alternative is resource leak. If one doesn't want with he can call the magics explicitly and take care of exceptions himself.

There's also no good way to use these classes without with

It's a feature.

as far as I can tell, opaque types only receive an _io argument, and never _parent and _root.

IDK what to answer now, needs some testing.

OK, but what's the reason for separating the parser from the struct?

No strong reasons, just "composition over inheritance" mindset.

don't think it's a compatibility thing... the open function just does exactly what the name says :)

It is a compatibility thing - keeping open instead of replacing it with File.

KOLANICH avatar Jan 27 '20 08:01 KOLANICH

No, I mean either the top type manages the KaitaiStream or programmer-written code manages it. All the subtypes are meant to be used only within the top one. So a context of a KaitaiStream is meant to be already entered when using any of subtypes, so they don't need an own context manager.

OK, I understand what you mean, but I don't think we can differentiate between these two cases at the type level. A top-level type isn't always the type that manages the IO stream (for example if the type is imported and used by another spec), and the type that manages the IO stream isn't always the top-level type of a spec (you can directly use non-top-level types in your application code, e. g. the_spec.TheSpec.SomeHelperType.from_bytes(b"...")). You would need to handle this at the object level, possibly with an extra constructor parameter to indicate whether the KaitaiStruct in question "owns" the IO stream.

In general I think this would be a good idea though, as it would prevent users from accidentally withing or closeing a KaitaiStruct object that doesn't own the stream that it uses.

There's also no good way to use these classes without with

It's a feature.

I understand that you want to make people do proper resource management, but removing close is not the way to do that, because it makes it harder to do proper resource management in cases where with is not an option. It's much cleaner to write thing.close() than thing.__exit__(None, None, None).

It is a compatibility thing - keeping open instead of replacing it with File.

If the Python devs wanted to change that, they would have done so with the io module that was introduced with Python 3 (and then backported to Python 2.6 and 2.7), or with pathlib that was introduced in Python 3.4. But they didn't - in both cases they kept the behavior of the old open function.

In any case, Python's IO stream behavior is what it is - streams are fully initialized/opened/working when they are created, __enter__ does nothing, and both close and __exit__ close the stream forever. I think KS should follow that standard behavior, unless there's a good reason to implement things differently than every other Python library out there.

dgelessus avatar Jan 27 '20 21:01 dgelessus

A top-level type isn't always the type that manages the IO stream (for example if the type is imported and used by another spec)

Yes and no. KaitaiStruct checks if the stream is opened, if it is, it means that someone else manages it. If it isn't than the struct manages it.

and the type that manages the IO stream isn't always the top-level type of a spec (you can directly use non-top-level types in your application code, e. g. the_spec.TheSpec.SomeHelperType.from_bytes(b"..."))

Yes. That's why the compiler should inherit everythjng from KaitaiStruct but use the internal stuff via super() (not sure if it is possible to construct it this way) in order not to call full KaitaiStruct.__init__. Another concern here is that super() may have more overhead than the overhead related to full KaitaiStruct. It has to be measured, I cannot even estimate the chances.

but removing close is not the way to do that, because it makes it harder to do proper resource management in cases where with is not an option. It's much cleaner to write thing.close() than thing.exit(None, None, None).

IMHO, magics cleaner, they make it explicit that we are dealing with a context manager and that we are doing something needing care. The Nones can be defaulted.

I think KS should follow that standard behavior, unless there's a good reason to implement things differently than every other Python library out there.

The whole point of context managers is to get rid of manual freeing. If we make the type to __enter__ on creation, this would mean that anuser would have to do it himself.

BTW: should we allow the type to be __entered__ after it has been __exitted__? IMHO we should.

KOLANICH avatar Jan 27 '20 23:01 KOLANICH