binary icon indicating copy to clipboard operation
binary copied to clipboard

Proper lazy parsing

Open BurningWitness opened this issue 1 year ago • 0 comments

Motivation

It seems like there are no proper lazy byte parsers in the ecosystem and binary is already known to be both bad performance-wise (https://github.com/kolmodin/binary/pull/194) and hacky implementation-wise (https://github.com/kolmodin/binary/issues/136#issuecomment-307424037).

Implementation

The rules are as follows:

  • The current chunk is always stored in full. The position within the chunk is tracked separately (ChunkOffset);

  • All known further chunks are stored separately in a lazy ByteString;

  • Rolling back is implemented as a snoc-list of further chunks (Rollback), where track needs to be kept of whether new chunks need to be stored on the rollback list (Policy) and whether a rollback can occur (Location).

    These two flags ensure the following behaviors:

    • When a new known further chunk is accessed, it will be stored on the rollback list if needed;

    • When the parser runs out of input and the user provides more, all of the new input will be stored on the rollback list if needed, alongside turning off the previous behavior (since we don't want to double-store chunks when nesting alternatives);

  • Rolling back only requires appending the rollback list to the remaining list of known further chunks. The current chunk remains the same.

Format deviations

  • Decoder now has a different shape, both accepting and returning lazy ByteStrings;

  • All reads now consistently fail at end-of-input (previously the behavior was inconsistent inside isolate);

  • isolate 0 (isolate 1 (pure ())) now fails with not enough bytes instead of ... less than the expected ....

Functions that could be added

  • Copying variants of functions that yield lazy ByteStrings, reallocating the first and last chunks of the result if they're partial.

  • match :: Get a -> Get (a, LazyByteString), where LazyByteString is referencing the underlying chunks.


Things that need resolving:

  • Performance. I do not know the optimization pipeline and thus cannot speculate what exactly is causing the 1.5x bottleneck on parsing Word8s repeatedly. Also the inlining is most probably over-eager.

  • The benchmarks don't tell a good story (sure, it is slower than the previous version, but it's also faster than cereal on brackets and than attoparsec on Struct4?)

  • isolate is currently eager input-wise. This is a format deviation that can be resolved by adding a third branch to it.

    I'm against isolate being in the API because it does not seem to be needed in the real-world and including it makes the parser both more complicated and slower (entering the next chunk may need a trim);

  • The API can be revamped (see the parse and initiate runners, along with a lot of unsafe functions), however I don't know what the preferred way of doing that is, so I kept the external changes to a bare minimum.

BurningWitness avatar Jan 12 '24 17:01 BurningWitness