nom
nom copied to clipboard
Feature Request: re_take_until!
I am trying to parse a not-so-structured document, and this would be a nice feature to have, so that I don't have to directly rely on regex
package (and for better code readability).
If you are interested, I can do a PR since it seems fairly straightforward.
hello, could you tell me more about what that combinator would do?
Example: re_take_until!("hello|world")
will take_until
that regular expression is matched.
Applying the above on It's not the end of the world!
should return (remaining input: world!
, output: It's not the end of the
)
Honestly, a regex-based combinator would be absolutely amazing to have.
I don't doubt for one moment that nom can do everything regex can do, but there's just something nice about the succinctness of being able to write something like "[^@]+@[^@\.]+\.\w+"
as a rudimentary email address parser that is appealing.
If you're then able to throw that into the greater nom ecosystem, that would be splendid.
I don't doubt for one moment that nom can do everything regex can do
Without diving into the academic way of looking at this statement, I don't think there is a nom equivalent of this particular proposal. The take_* macros only do T -> bool
or &[T]
, not other whole parsers.
there's a lot of regex based combinators, you can find them by looking for the prefix re_
on https://docs.rs/nom/4.2.3/nom/
Not sure if replying to me, but to clarify, this doesn't exist (yet), so I just wrote it myself:
// `take_till_match!(alt!(tag!("John") | tag!("Amanda")))`
// Running that on `"Hello, Amanda"` gives `Ok(("Amanda", "Hello, "))`
macro_rules! take_till_match(
(__impl $i:expr, $submac2:ident!( $($args2:tt)* )) => (
{
use $crate::lib::std::result::Result::*;
use $crate::lib::std::result::Result::*;
use $crate::lib::std::option::Option::*;
// TODO: replace nom with $crate
use nom::{Err, Needed,need_more_err, ErrorKind};
use nom::InputLength;
use nom::FindSubstring;
use nom::InputTake;
use nom::Slice;
let ret;
let input = $i;
let mut index = 0;
loop {
let slice = input.slice(index..); // XXX: this is bad with multi-byte unicode
match $submac2!(slice, $($args2)*) {
Ok((_i, _o)) => {
ret = Ok(input.take_split(index));
break;
},
Err(_e1) => {
if index >= input.len() {
// XXX: this error is dramatically wrong
ret = need_more_err(input, Needed::Size(0), ErrorKind::TakeUntil::<u32>);
break;
} else {
index += 1;
}
},
}
}
ret
}
);
($i:expr, $submac2:ident!( $($args2:tt)* )) => (
take_till_match!(__impl $i, $submac2!($($args2)*));
);
($i:expr, $g:expr) => (
take_till_match!(__impl $i, call!($g));
);
($i:expr, $submac2:ident!( $($args2:tt)* )) => (
take_till_match!(__impl $i, $submac2!($($args2)*));
);
($i:expr, $g: expr) => (
take_till_match!(__impl $i, call!($g));
);
);
I took @cormacrelf 's macro and made some changes.
First, I added a trait to allow "safe-slicing" of strings.
Secondly, I modified the macro to make use of the trait.
@lawliet89 that's closer, but you could reuse existing APIs by making the trait give you an Iterator &str::char_indices().map(|(i, _)| i)
and create an index++
version for byte slices. Here's what I ended up using in my code:
{
let input = $i;
for index in input.char_indices().map(|(i, _)| i) {
let slice = input.slice(index..);
match $submac2!(slice, $($args2)*) {
Ok((_i, _o)) => {
return Ok(input.take_split(index));
},
Err(_e1) => { },
}
}
need_more_err(input, Needed::Size(0), ErrorKind::TakeUntil::<u32>)
}
@cormacrelf Thanks for your suggestion! Made some changes and it looks much better.
Hey just stumbled upon this issue, I actually have a PR open for a take_until_parser_matches
which it seems like we could then just put a regex parser as the parameter just like any other nom parser, solving the deficiency @cormacrelf pointed out in https://github.com/Geal/nom/issues/709#issuecomment-475954895 . From my first-pass reading of @cormacrelf 's code in https://github.com/Geal/nom/issues/709#issuecomment-475958529 mine functions in a very similar way except its a function instead of a macro and it looks like @cormacrelf 's supports streaming whereas mine is does not.
Unfortunately it seems Geal is very busy right now so I have no idea when it'll get eyes on it again.
PR: https://github.com/Geal/nom/pull/469
I'd propose closing this as regex functions are no longer present in this crate. I've opened up a new issue on nom-regex, https://github.com/rust-bakery/nom-regex/issues/3, to continue the request.
I don't think take_until_parser_matches
is a good solution here, as iterating a regex-containing parser multiple times essentially redoes the work of a Regex "find" function, and thus eliminates a big performance benefit of using regex for this.