nom icon indicating copy to clipboard operation
nom copied to clipboard

New Combinator: discard_until / drop_until

Open Trequetrum opened this issue 2 years ago • 6 comments

This is a combinator I use all the time, might be useful to see something like it in this crate.

It drops a byte at a time until the given parser matches, then returns the result.

I don't do parsing in any really performance sensative contexts, this can probably be better implemented. This impl demonstrates the idea.

fn drop_until<'a, T>(
    parser: fn(&'a str) -> IResult<&'a str, T>,
) -> impl FnMut(&'a str) -> IResult<&'a str, T> {
    map(many_till(take(1u8), parser), |(_, matched)| matched)
}

Trequetrum avatar Dec 22 '22 23:12 Trequetrum

Isn't this equivalent to discarding the output of take_while?

let (s, _) = take_while(p)(s)?;

sshine avatar Jan 04 '23 08:01 sshine

Isn't this equivalent to discarding the output of take_while?

I don't fully understand understand how.


Lets say this is our input:

ahdHEahdkjbHELLOlkasjdLLadO

drop_until(tag("HELLO"))(input) 

returns:

OK(("lkasjdLLadO","HELLO"))

I suppose you could use

map(
    pair(
        take_while(not(tag("HELLO")),
        tag("HELLO")
    ),
    |(_, v)| v
)(input)

but is that better? Maybe... though it seems like this matches HELLO twice because of the not(tag("HELLO")) parser.

Trequetrum avatar Jan 04 '23 16:01 Trequetrum

I see that drop_until can more easily express some things.

Maybe defining your format in terms of the complement of something is more typical when parsing binary file formats, or when extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

For specifying language grammars, it makes more sense to positively define the thing you're skipping (comments, whitespace, etc.) even if you're just going to discard it. It was with this frame of mind that I assessed the usefulness of drop_until.

sshine avatar Jan 05 '23 09:01 sshine

Note that this somewhat parallels the conversation in #1223 / #1566 regarding how much should nom provide common parsers and people drop the output as needed vs provide specialized no-output parsers. One specific case of interest is in the multi module where there are specialized non-allocating variants as the overhead of capturing the output in that case is a lot higher. Note that instead of providing O=() parser variants, they are _count variants, not throwing the data away which is a common pattern in rust APIs.

epage avatar Jan 05 '23 23:01 epage

Note that this somewhat parallels the conversation in #1223 / #1566

Yeah, I can see that.

I think I'd address this less by the opportunity to tweak performance (as it seems like if your parser isn't allocating, but just returning a slice of the input, there's no performance hit) and more by an appeal to proving a gentle introduction to Nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. Generally, regex was never a serious consideration when parcing language grammars, for example. Conceptually, regex is often used to match to some embedded pattern of tokens in a lager context. A way to pull desired information from an otherwise noisy document.

Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

If that's true, they're likely going to try...

[...] extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

Trequetrum avatar Jan 06 '23 19:01 Trequetrum

more by an appeal to proving a gentle introduction to Nom.

With clap, a common problem I find is the larger the API is, the more likely people are to not find functionality they need. I feel like nom is on the cusp of that and would hope that nom limits the convenience variants of parsers to help new users with nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. ... Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

For some reason I don't see how this helps with aligning with regex. Maybe enumerating regex features and how you feel they line up with existing or potential parsers would help. That could also be a useful piece of documentation for nom.

epage avatar Jan 06 '23 20:01 epage