nom
nom copied to clipboard
length_value doesn't work with bits
The multi function length_value
doesn't work when parsing bits. It's clearly not meant to given the traits on the input, but since length_count
does it seems that length_value
should as well. I ran into this issue when attempting to parse a binary format that uses specific, non-byte lengths of bits for various fields, one set of which is a binary number indicating how many more bits to consume and parse (was hoping to compose length_value
with fold_many1
). I ended up writing a while-loop that tracks the consumed bits while it parses but would much prefer to keep it in idiomatic nom.
Prerequisites
- Rust version: rustc 1.56.1 (59eed8a2a 2021-11-01)
- nom version: "7.1.0"
- nom compilation features used: alloc, std
Test case
extern crate nom;
use nom::{
bits,
bytes::complete::{tag, take},
combinator::{map, map_res},
multi::{length_count, length_value},
IResult,
};
fn do_the_thing(i: &str) -> IResult<&str, &str> {
length_value(
map_res(take(2usize), |u: &str| u.parse::<usize>()),
tag("123"),
)(i)
}
fn do_the_thing_with_bits(i: (&[u8], usize)) -> IResult<(&[u8], usize), u64> {
length_value(
map(bits::complete::take(53usize), |u: u64| u),
bits::complete::tag(1u64, 1usize),
)(i)
}
fn do_a_similar_thing(i: (&[u8], usize)) -> IResult<(&[u8], usize), Vec<u64>> {
length_count(
map(bits::complete::take(11usize), |n: u16| n),
bits::complete::tag(1, 1usize),
)(i)
}
fn main() {
let input = "03123";
let result: IResult<_, _> = do_the_thing(input);
let input = vec![0x03];
let result: IResult<_, _> = bits(do_the_thing_with_bits)(&input);
}
The first and third functions compile as expected, but the second function doesn't because the trait bound (&[u8], usize): nom::InputTake is not satisfied
. The example is a bit contrived for simplicity, but it seems to me that length_value
should work if length_count
works.
Advent of code eh? :)
Hit the same issue myself https://github.com/Geal/nom/issues/1477
I think there's a subtle difference between length_count
and length_value
that makes it a little trickier.
length_count
has a pretty easy job: all it needs to do is get a number n
from the first parser, then run the second parser n
times. length_value
needs to do the following:
- get the length,
n
, from the first parser - take a prefix slice of length
n
from the input - run the second parser on that slice
The difficulty lies in that second step. For byte slices this is super easy, it just takes a slice. Bit slices are trickier though, and it's because of how nom implements them. In nom a bit slice is represented as a (&[u8], usize)
, where the usize
is an index indicating the starting bit in the first byte. How do we take a prefix slice of this? Let's say I have the single-byte input (&[0b1010_1010], 0)
, and I tell nom I want only the first 4 bits. nom has a concept of a start bit in bit slices, but no concept of an end bit, so there's no way we can represent this case using a bit slice. Note that the take method for bits returns an integer rather than a bit slice, presumably in part for this reason.
Not sure what the best way around this is. Changing the format of bit slices would be intrusive and would only be useful for a certain class of parsers (like this one). Another possibility is to have something like
/// Grabs the first `count` bits of the input and copies them into a new bit slice
fn copying_take(count: usize) -> impl FnMut((&[u8], usize)) -> IResult<(&[u8], usize), IResult<(Vec<u8>, usize), E> {
....
}
It needs to be a Vec
because of the allocation, so the ergonomics of this already clunky function probably aren't very good.
My workaround in advent of code looked something like this (but a little dumber), I assume yours was similar:
- read the length
n
from the first parser - verify that there are at least
n
bits available - give the second parser the smallest valid prefix slice of the input with at least n bits
- run the parser on that slice
- verify that the parser consumed exactly
n
bits
This worked in Advent of Code because the parser was never going to consume more than the requested number of bits, but you could very conceivably have a parser that would be happy to consume n
, or n+1
, or n+2
... bits, so I'm not sure this solution generalizes.
Ha! I looked for an open issue, found none, started writing my issue, then had to work, finished up my issue, and posted it. Shoulda checked again before posting I guess!
I see what you mean now, though I'm tracking the remaining number of bits left to consume, so detecting over-consumption is possible, then bubbling up an error from there would right the course: you told nom to parse 100 bits and it parsed 110, so the parse failed. take
would indeed need the concept of an end to return more than an integer, but I think length_value
could manage to track that since it can know what how many bits it had at the start and how many it would be returning. That said, I'm not an expert in nom or even Rust so there is likely some nuance. I suspect length_value
will just need a special bit implementation if it's possible at all.
Here's what I think the annoying nuance is:
Let's say I have some greedy parser, like many1(take(1usize))
, and let's say I have an implementation of length_value
that simply errors on overconsumption. Then let's say I do this (example is pretty stupid, but it's just for illustration):
fn parse(input: (&[u8], usize)) -> ... {
length_value(take(8usize), many1(take(1usize)))
}
fn test {
let res = parse((&[0x04, 0b1010_1111], 0));
...
}
This should logically succeed: length_value
reads the first 8 bits and gets n=4
, then should be able to run many1(take(1usize))
on the next 4 bits and get a successful result back. But in the implementation where we don't point to the end of the bit slice, we will hand 8 bits to the parser and it'll consume all 8 of them. Then length_value
will see that the parser overconsumed, and fail. This seems like really confusing and frustrating behaviour for length_value
to have. May be that there's a fix for this that I'm missing though.
Ah yeah I see it now. Maybe there's a way around it with copying bits, then consuming an equal number if it parses right? Might not be great for large numbers.
Thanks for your insight though! I hope there's a way around it - just to clean up my parser for a yearly game.
Yeah I think your suggestion works, just comes with the disadvantage—like you said—of being forced to do a copy, but it's also limited in that your output can't contain any references to the input, since that copy won't live very long. That being said, I'm guessing that references to the input aren't as common a pattern with bit parsers as they are with string parsers, so maybe that part isn't such a big deal.
use nom::{bits::complete::take, combinator::verify, multi::fold_many_m_n, IResult};
fn do_the_thing_with_bits(i: (&[u8], usize)) -> IResult<(&[u8], usize), u64> {
fold_many_m_n(
53,
53,
verify(take(1usize), |&o: &u8| o == 1), // could be replace by tag but verify is more general
|| (1 << 53) - 1, // precompute value for your use case
|acc, _: u8| acc,
)(i)
}