nom
nom copied to clipboard
Size from incomplete streaming crlf parser
The parser for matching \r\n
has a streaming version. I would like to ask about the case where the input is exactly \r
. This case is covered in unit test, and checks for an incomplete error containing Needed::new(2)
.
https://github.com/Geal/nom/blob/e2a0dd52e07c6c55ab4ee7fad88e987a814a7a0f/src/character/streaming.rs#L1093
My understanding of Needed
is that it counts the number of extra bytes necessary to parse. And in this case it should be one. Is returning two instead of one deliberate?
https://github.com/Geal/nom/blob/e2a0dd52e07c6c55ab4ee7fad88e987a814a7a0f/src/character/streaming.rs#L149
I recognise that this one byte does not matter in the grand scheme of things. But for correctness, perhaps Needed::Unknown
is more appropriate? It is technically known. Checking the input length for zero or one should given an answer, but the input trait does not allow for this, I think. Also technically possible by comparing \r
, but seems over the top for the difference of one byte.
Meaning of "needed"
This is a tangentially related question. Consider an alt
of two streaming tag
parsers matching "hello"
and "help"
, acting on "hel"
. We may have a result in one or two bytes. What error should this return?
It currently returns one or two, depending on which tag
comes first in the alt
list. This feels ambiguous.
fn alt(input: &str) -> nom::IResult<&str, &str> {
nom::branch::alt((
nom::bytes::streaming::tag("help"),
nom::bytes::streaming::tag("hello"),
))(input)
}
assert_eq!(
alt("hel"),
Err(nom::Err::Incomplete(nom::Needed::new(1))),
// // If "hello" first.
// Err(nom::Err::Incomplete(nom::Needed::new(2))),
);
Sorry if this seems irrelevant. I am trying to understand what I can rely on from the needed
value, and what to return when writing custom streaming parsers.
Environment
- Rust version : rustc 1.53.0 (53cb7b09b 2021-06-17)
- nom version : 6.2.1, latest as of writing.
I am trying to understand what I can rely on from the needed value
If a parser returns some amount of needed bytes x
.
- If you add less than
x
bytes to the input the parser will most likely returnNeeded
again. - If you add
x
or more bytes to the input the parser may or may not returnNeeded
again.
There is no hard guarantee that lets you know for sure when you have added enough input (except calling the parser and having it return a result). But Needed
lets you know when you can skip calling the parser because you dont have enough input yet.
It currently returns one or two, depending on which tag comes first in the alt list. This feels ambiguous.
Its not really ambiguous if you keep the information above in mind. If alt returned the highest Needed
value from all of its subparsers then it might return too much.
To pick up your example, imagine the incomplete "hel" actually completes to "help". There is only 1 byte missing until the parser returns a result, but picking the largest Needed
value would result in 2. The alt parser can only know for sure that 2 bytes are needed if the first parser indicates that it can definitely not match the input, which is only possible if it has enough input to make sure thats correct.
In conclusion, there is only 1 byte Needed
before the parser may or may not return a different result.
Also in order to collect the highest Needed
value alt would have to continue executing every parser in its chain, collect the number of needed bytes for each and then return the maximum from that list. This would mean a massive hit to performance and is therefore not done.
That being said the crlf
parser should probably do an additional check on Incomplete to see if the first \r
character is present and then only return Needed::new(1)
.
If a parser returns some amount of needed bytes
x
.
- If you add less than
x
bytes to the input the parser will most likely returnNeeded
again.- If you add
x
or more bytes to the input the parser may or may not returnNeeded
again.
Thank you for clarifying. This is part of why I felt confused. If both cases boil down to "may or may not require more data", why not just treat every Needed
value the same as Needed::Unknown
?
There is no hard guarantee that lets you know for sure when you have added enough input (except calling the parser and having it return a result). But
Needed
lets you know when you can skip calling the parser because you dont have enough input yet.
I guess the point is, then, that reading the Needed
number of bytes does not guarantee completion, and is an advise for meaningful forward progress in parsing, for some definition of "meaningful".
To pick up your example, imagine the incomplete "hel" actually completes to "help". There is only 1 byte missing until the parser returns a result, but picking the largest
Needed
value would result in 2. The alt parser can only know for sure that 2 bytes are needed if the first parser indicates that it can definitely not match the input, which is only possible if it has enough input to make sure thats correct. In conclusion, there is only 1 byteNeeded
before the parser may or may not return a different result.
In this case, I suppose the meaningful progress after reading one more byte would be the confirmation or elimination of one tag.
Also in order to collect the highest
Needed
value alt would have to continue executing every parser in its chain, collect the number of needed bytes for each and then return the maximum from that list. This would mean a massive hit to performance and is therefore not done.
Indeed. Such an approach feels impractical, and the result not worth the effort. You have convinced me that Needed
should not be seen as an upper bound for a completion.
That being said the
crlf
parser should probably do an additional check on Incomplete to see if the first\r
character is present and then only returnNeeded::new(1)
.
This brings me back to the main question: Is Needed::new(2)
considered one of many acceptable return values because it is advisory, with Needed::new(1)
being preferred?
As for the implementation detail, would it make more sense to just check for \r
and then \n
separately?
line_ending()
could be write:
pub fn line_ending<I, E>(input: I) -> IResult<I, (Option<char>, char), E>
where
I: InputIter + Slice<RangeFrom<usize>>,
<I as InputIter>::Item: AsChar,
E: ParseError<I>,
{
pair(opt(char('\r')), char('\n'))(input)
}
I agree that this is a bug. Needed
should include the minimum number of bytes required to make progress, signaling to a streaming system to buffer at least that many bytes before retrying. It's easy to imagine this resulting in correctness issues: suppose you had a CRLF delimited document, that had been read up to the CR. The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.
Regarding your alt
+ Needed
question: This is something I just recently encountered as well. I'm not really sure I know what the best solution is; I've juggled with things like "if they all return Needed, return Needed with the lowest value". However, I think it's important that alt
always returns the first matching subparser, and designers using it come to rely on this behavior. I think it therefore makes sense that Needed
is propagated immediately, because the earlier parser could succeed, and therefore we should get more bytes to give it the chance before continuing with other branches.
The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.
That not how I see thing, I think one should read it as "at least 1", you try to read as much data as possible and try again, if you read at least one more octet you can retry.
Maybe:
- Change name variant of
Needed::Size
toHint
- Make alt always return max
Size(1)
orUnknown
cause it's impossible foralt
to know the subparser behavior
I agree that this is a bug.
Needed
should include the minimum number of bytes required to make progress [...]
I had wondered about that. But I could not think of an acceptable definition for "progress". For example, if progress is an opportunity to complete or error out, then for most parsers, the minimum would just be one, by virtue of them doing some sort of filtering. (On the other hand, I had only started on my Rusty and nommy journey a month ago, so may be I do not have enough experience to say that.)
[...] The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.
Hmm... I guess that might be a problem because... if remote closes, then the stream might have to wait for a timeout?
I think it therefore makes sense that
Needed
is propagated immediately, because the earlier parser could succeed, and therefore we should get more bytes to give it the chance before continuing with other branches.
Indeed. It does make sense to keep this behaviour. I was not troubled by this particular example myself. I merely used it as an illustration of the issue -- well-definedness of what Needed
means. And we already have different people interpreting them differently! Keeping track of a minimum Needed
value amongst alt
subparsers sounds like a good compromise. That might not be a "real" minimum though, if the input will fail every subparser in the next byte.
Change name variant of
Needed::Size
toHint
I suppose that makes sense from the user's point of view. If a Needed
value is returned, what do we expect them to do with it?
- If it is a hint, then "try to read that many bytes, and then parse again, as long as you managed to read something." In this case, then, what conditions do we have to not parse again? Specifically, how is the handling different from a
Needed::Unknown
? - If it is minimal, then may be "don't bother parsing again without reading that many bytes".
That not how I see thing, I think one should read it as "at least 1", you try to read as much data as possible and try again, if you read at least one more octet you can retry.
I think I agree? Definitely a buffered stream reader should be reading in large chunks. I only mean that Needed(2)
means that you need at least 2 more bytes, so a buffering system needs to read at least 2 more bytes before it retries the parser.
When I say "can only provide one more byte" I mean that the underlying reader (ie, a TCP stream) provides one more byte and then the connection closes. In that case, given a Needed(2)
, a parser adapter could correctly conclude that "the underlying parser requested at least 2 more bytes, but I was only able to get one more byte, so I can return an unexpected EoF without retrying the underlying parser"
Specifically, how is the handling different from a
Needed::Unknown
I agree that I haven't been able to think of a useful distinction here. I think that semantically there's a difference:
-
Needed
is for a parser that reads something of a specific length, such astag
, or a length-prefixed chunk -
Unknown
is more for things like "parse any number of alphanumerics, followed by a non alphanumeric".
However, in practice I don't know the difference between Unknown
and Needed(1)
; in both cases the behavior should be "load at least one more byte (but probably more) and retry. Potentially it could be good for error reporting.