chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

`binary` mod to match `text`

Open CraftSpider opened this issue 2 years ago • 3 comments

Proposal

A binary module to match text, with similar helpers but for non-textual input.

TL;DR Explanation

The binary module would have functions for parsers like int and string, which would interpret the relevant types from u8 streams.

Full Explanation

The proposed module would contain helpers for parsing binary files, in the same vein as there already exist helpers for parsing text.

Some helpers that would be nice to have (exact signatures up for bikeshedding, main point is what they'd allow):

  • int<I>(endian: Endian) for reading size_of::<I>() bytes as an integer of a particular type and endian
  • float<F>(endian: Endian) same for floats. Maybe combine these
  • string(ty: StringTy) for reading strings. The type would be like, 'null terminated' or 'length prefixed'
  • Possibly other types as well, such as Vec, Arrays, or similar.

I think chumsky has good potential as a tool for parsing non-textual files as well as text-based ones, but these kind of primitive operations are currently missing. With just a handful of these basic tools, parsing binary files could be just as painless as any other type.

CraftSpider avatar Dec 08 '21 03:12 CraftSpider

While I'm not outright opposed, I'm a little reluctant to implement these for a few reasons:

  • Chumsky is not intended for high-performance parcing, it's designed for high-quality error messages. This makes it less useful in the domain of binary parsing given that parsers like nom exist. For most cases, I'm sure that nom's binary parsing is likely several times faster than Chumsky's, at least.

  • The potential scope of this seems unbounded and is probably better covered by dedicated binary parsing crates like bincode, bson, etc.

I think it's fair to say that Chumsky is pretty specifically oriented towards the parsing of text for the purpose of generating human-readable error messages, so I'm worried that this falls outside the scope of the library and might become quite difficult to maintain.

zesterer avatar Dec 08 '21 14:12 zesterer

That's fair. I was honestly thinking that performance still wasn't the primary goal - I tend to write binary parsers for reverse engineering purposes, a situation where chumsky's focus on errors and ease of iteration over speed seem perfect. Often the trickiest part of parsing an unknown format is figuring out how your parser is wrong, good errors can make that significantly easier.

Scope could be an issue, though I think it would be fine to pick a set of 'basic' operations, and anything that starts asking thorny implementation questions could be pushed to an external extension. All of it could be an extension, but I feel just accepting a limited set of basic operations is a sufficiently minimal scope for sufficiently high usefulness.

I understand if in the end you don't want to do this, but I think it would at least be useful for some cases.

CraftSpider avatar Dec 08 '21 15:12 CraftSpider

This could be a useful thing to have in an extension crate, particularly when I add support for zero-copy parsing (see #9 ).

zesterer avatar Dec 09 '21 17:12 zesterer