regex icon indicating copy to clipboard operation
regex copied to clipboard

Add convenience methods for extracting bits of text

Open orlp opened this issue 3 years ago • 17 comments

A lot of my use cases for regex involve extracting some information from text with a simple pattern, where full parsing would be overkill. In this scenario I generally just have N subsections of the text I'm looking to extract (possibly many times), where N is a small constant. Some examples from the recent advent of code:

let re = Regex::new(r"target area:\s*x=(-?\d+)..(-?\d+), y=(-?\d+)..(-?\d+)")?;
let re = Regex::new(r"Player 1 starting position:\s*(\d)\s*Player 2 starting position:\s*(\d)\s*")?;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;

In such a scenario it is quite the pain to extract the information I want using the current API. It generally involves something like this, for the last example:

use anyhow::Context;
use itertools::Itertools;
let (status, x1, x2, y1, y2, z1, z2) = re
    .captures(&line)
    .context("match failed")?
    .iter()
    .skip(1)
    .map(|c| c.unwrap().as_str())
    .collect_tuple()
    .unwrap();

Note that I already had to include a third-party library itertools just to get collect_tuple, without that we would either have to start repeatedly calling next() or get(i), or invoke an unnecessary allocation, and in both cases an extra check is needed to ensure there weren't too many captures (indicating a programmer error).

I think that adding a convenience method for this incredibly common use-case would be nice. I would propose the following (with sample implementation):

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
    let caps = re.captures(text)?;
    let mut participating = caps.iter().flatten();
    let whole_match = participating.next().unwrap().as_str();
    let captured = [0; N].map(|_| participating.next().unwrap().as_str());
    assert!(participating.next().is_none(), "too many participating capture groups");
    Some((whole_match, captured))
}

And similarly extract_iter returning impl Iterator<Item = (&'t str, [&'t str; N])>.


Then, the cumbersome example above would become the elegant:

let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

Another example from the README:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for caps in re.captures_iter(TO_SEARCH) {
    // Note that all of the unwraps are actually OK for this regex
    // because the only way for the regex to match is if all of the
    // capture groups match. This is not true in general though!
    println!("year: {}, month: {}, day: {}",
                caps.get(1).unwrap().as_str(),
                caps.get(2).unwrap().as_str(),
                caps.get(3).unwrap().as_str());
}

This could become:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for (_date, [year, month, day]) in re.extract_iter(TO_SEARCH) {
    println!("year: {}, month: {}, day: {}", year, month, day);
}

Caveats

This API does panic, but I believe that this is fine. It will only ever panic due to a direct programmer error (mismatching possibly participating capture groups to the number of outputs), which would always have been wrong, and does not represent an exceptional but otherwise valid scenario. And for anyone for which this is not acceptable, they can always still use the original API.

Additionally, we call .flatten() on the capturing groups to hide any non-participating capturing groups. I do believe this is actually fine as well, allowing uses cases such as this, where we have an alternation with a similar capture group on either side, without wanting to capture the context:

let id_strings_re = Regex::new(r#""id:([^"]*)"|'id:([^']*)'"#)?;
let [id] = id_strings_re.extract(text).unwrap().1;

If this would be a welcome addition I'm more than happy to make a pull request.

orlp avatar Dec 22 '21 16:12 orlp

I do find this somewhat intriguing. Const generics does make this sort of API reasonably appealing.

I haven't given this a ton of thought yet, but one question did pop up. Toward the end, you mention that capturing groups are flattened to remove non-participating capturing groups. You provide an example where both sides of an alternation have the same number of captures. But what happens when both sides of the branch have a different number of capturing groups? I believe in such a case, it seems almost impossible to correctly use this new extract API. Because the size of the array returned is determined at compile time, but because of the flattening, the actual number of capture matches is determined at runtime. So it's impossible to reconcile.

I'm not saying that this means we shouldn't add an API like this, but it seems like a possible footgun to me. And I don't think it's a terribly uncommon use case, although certainly not as common as others.

BurntSushi avatar Dec 22 '21 17:12 BurntSushi

But what happens when both sides of the branch have a different number of capturing groups? I believe in such a case, it seems almost impossible to correctly use this new extract API. Because the size of the array returned is determined at compile time, but because of the flattening, the actual number of capture matches is determined at runtime. So it's impossible to reconcile.

Yes, I would consider this a programmer error. There are three ways we could deal with it:

  1. As it is now in the sample implementation, only panic if the number of matched participating groups differs from N. A bit questionable, because your program might seem to work correctly in testing, but panic and fail for a different input.

  2. During regex compilation, keep track of the number of capture groups in the syntax tree. When a|b occurs in the regex, and the number of capture groups for a and b differs, set a flag. Then, extract will check this flag at the start, and if set, immediately panic. (This does assume this problem can only be introduced due to alternation, correct me if I'm wrong on this.)

  3. Don't allow any 'fallible capture groups' (that is, a capture group inside an alternation, or any other such constructs I've missed) whatsoever when using extract, again by checking and setting a flag during compilation. I think this is unnecessarily strict.

orlp avatar Dec 22 '21 18:12 orlp

I think it can be introduced in other ways, e.g., (?:(\w)(\s))? might have 2 or 0 capturing groups. But either way, your suggestion is I believe workable: such a property is something that can be detected by analyzing the syntax of the regex.

BurntSushi avatar Dec 22 '21 18:12 BurntSushi

And I'm sure that there's valid regexes that can be proven to always have the same number of capturing groups that would be excluded by such a static analysis. This would be the advantage of approach nr. 1, nothing would be disallowed. But I do think approach nr. 2 is the most sensible for most users, preventing accidental footguns. Then you can be sure that if it works in testing, it always works. Again, people can always use the original API if they want full flexibility.

Briefly browsing through the source code I think adding a 'deterministic capture group' check to Hir shouldn't be too difficult by adding a method fn deterministic_capture_group_count(&self) -> Option<usize> which recursively computes the value, using the basic rules such that an Alternation HirKind's children must all have an equal Some(x), else the result is None. This value is then computed once during regex compilation near the end, and stored, and extract can then check

assert_eq!(exec.ro.deterministic_capture_group_count, Some(N), "extract must always have N capture groups")

orlp avatar Dec 22 '21 18:12 orlp

Ideally such properties are computed inductively and not recursively. The Hir type has a bunch of "smart" constructors for exactly this reason, and indeed, they are the only public way to compose regular expression syntactic values, precisely so that a number of properties can be computed inductively. This avoids needing to do another recursive pass on the syntax tree.

In any case, I haven't fully decided whether this is an API I'd want. I need to noodle on it. And ultimately get more eyes on it, since once it's added, that's it. No going back.

BurntSushi avatar Dec 22 '21 18:12 BurntSushi

Alright, fair enough, in that case it would be an attribute on HirInfo. Out of curiosity I went over the cases and came up with this:

For empty(), literal(), class(), anchor(), word_boundary():

info.det_capture_count = Some(0);

For repetition(rep):

info.det_capture_count = match (rep.kind, rep.hir.det_capture_count) {
    // Always zero.
    (_, Some(0)) => Some(0),
    (RepetitionKind::Range(RepetitionRange::Exactly(0)), _) => Some(0),
    (RepetitionKind::Range(RepetitionRange::Bounded(_, 0)) => Some(0),

    // Nondeterministic.
    (RepetitionKind::ZeroOrOne, _) => None,
    (RepetitionKind::ZeroOrMore, _) => None,
    (RepetitionKind::Range(RepetitionRange::AtLeast(0)), _) => None,
    (RepetitionKind::Range(RepetitionRange::Bounded(0, _)) => None
    
    // Deterministic.
    (RepetitionKind::OneOrMore, Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::Exactly(_)), Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::AtLeast(_)), Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::Bounded(_, _)), Some(n)) => Some(n),
};

For group(grp):

info.det_capture_count = match grp.kind {
    GroupKind::CaptureIndex(_) => grp.hir.info.det_capture_count.map(|n| n + 1),
    GroupKind::CaptureName(_) => todo!("unresolved question: disallowing named captures, or treat as unnamed?"),
    GroupKind::NonCapturing(_) => grp.hir.info.det_capture_count,
};

For concat(exprs):

info.det_capture_count = exprs.iter().fold(Some(0), |cnt, e| Some(cnt? + e.info.det_capture_count?));

For alternation(exprs):

info.det_capture_count = exprs.iter().reduce(|a, b| (a? == b?).then(|| a.unwrap())).unwrap_or(Some(0));

It does lead to an unresolved question: what should we do regarding named capture groups when using extract?

orlp avatar Dec 22 '21 19:12 orlp

I think they get treated as unnamed groups. Every capturing group, whether named or not, is addressable by index. And this API merely requires that groups are indexable, which is satisfied by all such groups. Named groups just give a name to a group in addition to an index.

BurntSushi avatar Dec 22 '21 19:12 BurntSushi

Ok, that makes sense, then it's just

GroupKind::CaptureIndex(_) | GroupKind::CaptureName(_) => grp.hir.info.det_capture_count.map(|n| n + 1),

orlp avatar Dec 22 '21 19:12 orlp

In any case, I haven't fully decided whether this is an API I'd want. I need to noodle on it. And ultimately get more eyes on it, since once it's added, that's it. No going back.

How is the noodling coming along? As a summary refresher I believe we agreed on a static analysis method being feasible to allow a deterministic number of capture groups be efficiently determined during Regex compile time. This would allow

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N, or is non-deterministic.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])>

to always immediately panic when called with an incorrect N or if the regex has an non-deterministic number of capture groups. This makes the extract method either always succeed or always fail, and never a footgun depending on the text input, which was your main concern on the initial proposal. The existence of this method would make using regex (in my opinion) vastly more convenient to use for extracting matches from text, e.g. replacing

use anyhow::Context;
use itertools::Itertools;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let (status, x1, x2, y1, y2, z1, z2) = re
    .captures(&line)
    .context("match failed")?
    .iter()
    .skip(1)
    .map(|c| c.unwrap().as_str())
    .collect_tuple()
    .unwrap();

with

use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

orlp avatar Apr 30 '22 07:04 orlp

Is it possible to provide a bigger code example? Looking at your samples with fresh eyes, they look unnecessarily convoluted by the fact that you're using tuples. But if you're using tuples, why not just use capture indexing instead? Because you have a tuple, you must know the number of capturing groups at compile time. For example:

let (x, y, z) = (caps[1], caps[2], caps[3]);

I believe this is equivalent to your code but much simpler.

BurntSushi avatar Apr 30 '22 11:04 BurntSushi

I believe this is equivalent to your code but much simpler.

Well, to be apples to apples with the previous example we have to extract all seven parts:

use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let caps = re.captures(&line).context("match failed")?;
let (status, x1, x2, y1, y2, z1, z2) = (caps[1], caps[2], caps[3], caps[4], caps[5], caps[6], caps[7]);

This sort of approach really smells though, for multiple reasons:

  1. We have hardcoded a loop over the captures using manual indexing. This is error prone, and boiler-platey.

  2. This code is not actually equivalent in the sense that we have lost the panic when we mess up and mis-match the number of capture groups to our extraction. For example, say we no longer needed the status but forgot to remove the capture group from the regex, and introduced this bug:

    let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
    let caps = re.captures(&line).context("match failed")?;
    let (x1, x2, y1, y2, z1, z2) = (caps[1], caps[2], caps[3], caps[4], caps[5], caps[6]);
    

    This would just silently 'succeed' because our indexing scheme does not check that the number of capture groups matches our extraction. The extract method would correctly instantly panic, as would the collect_tuple.

  3. The indexing scheme is less flexible because it does not flatten deterministically determined capture groups. E.g. consider an (overly) simplified regex for extracting URLs from HTML:

    let re = Regex::new(r#"<(a)[^>]+href="([^"]+)"|<(img)[^>]+src="([^"]+)""#);
    for (_el, [tag, url]) in re.extract_iter(&html).context("match failed")? {
        println!("{tag} {url}");
    }
    

    You can't do this with the indexing API, because we have four capture groups that alternatingly participate.

  4. The indexing approach also suffers from your initially raised footgun: captures can participate or not depending on the user input. Because we haven't done any static checking for deterministic capture groups the indexing approach might succeed with test inputs but fail in production. With extract and extract_iter we can rest easy that our number of capture groups always matches our intent exactly, regardless of the string being matched.

orlp avatar May 01 '22 06:05 orlp

Those are good points.

Does the ecosystem have a way of introducing unstable APIs into crates? I'm not convinced we can get this right on our first try. It would be better I think to make it available somehow without committing to it to see how it gets used in practice.

BurntSushi avatar May 01 '22 12:05 BurntSushi

@BurntSushi tokio does it via a rustc cfg flag: https://docs.rs/tokio/latest/tokio/index.html#unstable-features The most important reason for this (as opposed to a regular feature) is that you can't silently depend on an unstable feature transitively. This way always requires the final person building the crate to set the regex_unstable rustc flag and be aware that they're opting into semver breaking changes. This seems sensible and can definitely be a way to go.

In this very specific scenario there might be an alternative middle-ground. Note that Regex already has this method:

/// Returns the number of captures.
pub fn captures_len(&self) -> usize

I believe could fairly reasonably stabilize the following method on Regex:

/// Returns the number of participating captures that this regex will return on a successful match.
/// If this number is non-deterministic and would depend on the input this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize>

This function to me looks well-defined and future-proof (we can always return None if no static analysis for future features is possible/reasonable), so even conservatively I don't think we have to worry about getting this function right.

Then if this function were available in regex, we could write extract and extract_iter as proposed as an extension trait in a crate, say, regex_unstable_extensions, where this crate would forever stay < 1.0 to not run into semver stability issues. Then a user would simply state

use regex::Regex;
use regex_unstable_extensions::Extract;

and have extract and extract_iter methods available in their scope for Regex. If we feel these functions are a success and properly designed we can move them into regex proper.

orlp avatar May 01 '22 15:05 orlp

I like the idea of starting with a participating_captures_len. That's a much easier pill to swallow. And it would need to be implemented anyway for extract to work.

My preference would be to wait until #656 lands to add any new features, but this is somewhat small so I think I'd be willing to accept a PR for this. Although it will require I think paving some new ground in regex-syntax, since today I believe all inductive properties are predicates, where as this one is an integer. I don't think it will be a hard path to pave, but it's worth saying that it will need to be paved, if that makes sense. :)

(I don't have any time table for #656 landing. It is my primary focus, but I've been working on it for years already. I just don't have the time I used to devote to FOSS.)

Also, small nit, but "non-deterministic" doesn't sound right to me in this context. The real key here is that the compile time syntactic analysis is impossible to reconcile with the runtime behavior in all cases. I'm not sure how to highlight that pithily, but perhaps being pithy is the wrong approach for something subtle like this anyway. But that wording can be hashed out later.

BurntSushi avatar May 01 '22 16:05 BurntSushi

Also, speaking for my own maintenance bandwidth, I definitely don't have the capacity to maintain a separate unstable crate. That might be something you want to do if you want to go that path. But I think I could take on a cfg gate like what tokio does. That looks low cost.

BurntSushi avatar May 01 '22 16:05 BurntSushi

@BurntSushi Just to confirm I read you correctly, you would be open to accepting a PR implementing participating_captures_len as a start?

As for the non-deterministic wording, how about:

/// Returns the number of participating captures that this regex will
/// return on a successful match. If this number can not be statically
/// determined from the regex this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize>

orlp avatar May 01 '22 17:05 orlp

@orlp Yeah! I'll have the think on the name a bit more and I'll review docs more carefully in PR review.

BurntSushi avatar May 01 '22 17:05 BurntSushi

Should participating_captures_len get merged, the new proposed implementation for extract is:

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures can not be statically determined to be equal to N.
/// This is likely due to an optional capture group, or a capture group mis-match between the left
/// and right hand side of an alternation. If you need an optional capture group, consider making the
/// contents of rather than the group itself optional, e.g. `((?:foo)?)` instead of `(foo)?`.
pub fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
    match self.participating_captures_len() {
        Some(c) if c == N => {
            let caps = self.captures(text)?;
            let mut participating = caps.iter().flatten();
            let whole_match = participating.next().unwrap().as_str();
            let captured = [0; N].map(|_| participating.next().unwrap().as_str());
            Some((whole_match, captured))
        }
        Some(c) => panic!("wrong number of capture groups, expected {} found {}", N, c),
        None => panic!("regex could not statically be determined to have {} capture groups", N),
    }
}

With a self-contained usage example:

let re = Regex::new(r"\s*(\d{4})-(\d{2})-(\d{2})\s*").unwrap();
let [year, month, day] = re.extract(&date).expect("could not parse date").1;

While it and its cousin extract_iter are fairly trivial to implement, I strongly feel that this method would make regex much more convenient out-of-the box for the average user if they are interested in extracting a bit of text from a larger document or simply destructuring text. I think its inclusion in regex would help a lot of people save time, and reduce bugs as the number of matches is now automatically and statically checked.

orlp avatar Sep 24 '22 15:09 orlp

I grant that it looks very appealing, but the footgun here is very concerning to me. I'm not sure how to mitigate it, and if we can't, whether it's still worth adding. I'm just not sold on it, sorry.

BurntSushi avatar Sep 24 '22 17:09 BurntSushi

I grant that it looks very appealing, but the footgun here is very concerning to me. I'm not sure how to mitigate it, and if we can't, whether it's still worth adding. I'm just not sold on it, sorry.

Which footgun are you referring to?

orlp avatar Sep 24 '22 18:09 orlp

Panicking when the number of capturing groups isn't fixed.

BurntSushi avatar Sep 24 '22 18:09 BurntSushi

I think there are other problems too, particularly with its flexibility. i.e., If you want to only get the first N capturing groups, you still have to write out a type that handles all of the capturing groups.

BurntSushi avatar Sep 24 '22 18:09 BurntSushi

Panicking when the number of capturing groups isn't fixed.

I would say this is a feature rather than a footgun, it ensures that you'll never get a panic due to a bad index with some unexpected input. It guarantees that if you test your function once it will handle all possible inputs. I think it will reduce bugs rather than introduce them.

I think there are other problems too, particularly with its flexibility. i.e., If you want to only get the first N capturing groups, you still have to write out a type that handles all of the capturing groups.

I don't think that's all that bad. E.g. if I only wanted the year in the above I have two options:

let [year, _, _] = re.extract(&date).expect("could not parse date").1;
let [year, ..] = re.extract::<3>(&date).expect("could not parse date").1;

The latter scales even for very large number of capture groups. Plus it's not like extract replaces anything, you can still use the old way if it's more convenient.

orlp avatar Sep 24 '22 18:09 orlp

I would say this is rather a feature than a footgun, it ensures that you'll never get a panic due to a bad index with some unexpected input. It guarantees that if you test your function once it will handle all possible inputs. I think it will reduce bugs rather than introduce them.

That's a decent point.

I'd like to revisit this after #656 lands. I think the next step will be to write up the API along with its documentation and doc examples, and put it into the first comment of this issue. At that point, we should request feedback from folks to make sure we aren't missing anything.

BurntSushi avatar Sep 24 '22 18:09 BurntSushi

I also wonder whether it might be better to flatten the tuple into a single array, where the first element of the array is the entire match. That would be more consistent with the rest of the APIs in the crate.

BurntSushi avatar Sep 24 '22 18:09 BurntSushi

I'd like to revisit this after https://github.com/rust-lang/regex/issues/656 lands.

That is fine.

I also wonder whether it might be better to flatten the tuple into a single array, where the first element of the array is the entire match.

My only problem with that is that it makes using [T; N]::map for follow-up parsing more annoying, as there isn't an easy way to split arrays. Say with the date example if you wanted it as numbers you could do

let ymd = re.extract(&date).expect("could not parse date").1;
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap());

if the result was simply [&str; N] including the whole string you would need to either manually specify the number of capture groups to extract, or repeat the year, month, day thrice:

// Specify length manually.
let [_, ymd @ ..] = re.extract<4>(&date).expect("could not parse date");
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap());

// Repeat variables.
let [_, year, month, day] = re.extract(&date).expect("could not parse date");
let [year, month, day] = [year, month, day].map(|x| x.parse::<u32>().unwrap());

I just feel that most of the time when extracting you're only interested in the bits you extract, and not the whole match. And when you do want the whole thing it's there as the first tuple member.

That said this isn't a hill I'm willing to die on, if you'd still prefer a flat single array.

orlp avatar Sep 24 '22 18:09 orlp

I've been giving this API proposal a bit more thought. I've thought of some other ideas/questions/concerns:

  1. It seems like this API would discourage the use of capture group names in favor of relying on capture group positions.
  2. I believe an extract() method could be added to Captures, which would make the API a bit more orthogonal at the expense of additional typing. So for example, you'd need to do re.captures(haystack).map(|c| c.extract()) instead. It would similarly avoid needing to add a new iterator type, e.g., re.captures_iter(haystack).map(|c| c.extract()).

I want to clarify that I am overall positive on adding this API. I'm just trying to think of every angle I can, because even though it looks small, I consider this to be a major addition.

BurntSushi avatar May 03 '23 18:05 BurntSushi

It seems like this API would discourage the use of capture group names in favor of relying on capture group positions.

I suppose. You could design an extract API that uses capture group names as well, but that would involve procedural macros.

I believe an extract() method could be added to Captures, which would make the API a bit more orthogonal at the expense of additional typing.

I really don't understand what that would do. Would .extract() here just be shorthand for .unwrap().as_str()?

If so, we would lose all the benefits of checked length we discussed above, and it would be significantly less convenient as you can not pattern match like you can with an array. Perhaps you could elaborate a bit more on how .extract() on Captures would work?

EDIT: I believe I misunderstood the .map to be the iterator map rather than a Result map. Would I be correct in understanding that you would propose instead of this:

let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

it would be this?

let [status, x1, x2, y1, y2, z1, z2] = re.captures(&line).context("match failed")?.extract().1;

If so, I would be fine with that. There's pros and cons for both. My main concern is that the API is less discoverable if a user has to click on Captures in the documentation to find out there's an extract method at all.

orlp avatar May 04 '23 11:05 orlp

Yes, your edit is correct. It's the same API, just on Captures.

I am tempted to start there, because even if we added Regex::{extract, extract_iter}, it would probably make sense to also add it to Captures as well. Starting there is also a little more conservative. If the API turns out to be a bad idea, the fact that it's a little less discoverable will be an advantage. But if it turns out to be a hit, then we can add Regex::{extract, extract_iter} later.

I'm going to plan on getting this added to the regex 1.9 release. And I'll try and mitigate the discoverability problem, at least for now, by using it in top-level examples in the crate docs. (Which I am currently rewriting.)

BurntSushi avatar May 04 '23 11:05 BurntSushi

Here is the API item for Captures::extract. What do you think?

    /// This is a convenience routine for extracting the substrings
    /// corresponding to matching capture groups.
    ///
    /// This returns a tuple where the first element corresponds to the full
    /// substring of the haystack that matched the regex. The second element is
    /// an array of substrings, with each corresponding to the to the substring
    /// that matched for a particular capture group.
    ///
    /// # Panics
    ///
    /// This panics if the number of possible matching groups in this
    /// `Captures` value is not fixed to `N` in all circumstances.
    /// More precisely, this routine only works when `N` is equivalent to
    /// [`Regex::static_captures_len`].
    ///
    /// Stated more plainly, if the number of matching capture groups in a
    /// regex can vary from match to match, then this function always panic.
    ///
    /// For example, `(a)(b)|(c)` could produce two matching capture groups
    /// or one matching capture group for any given match. Therefore, one
    /// cannot use `extract` with such a pattern.
    ///
    /// But a pattern like `(a)(b)|(c)(d)` can be used with `extract` because
    /// the number of capture groups in every match is always equivalent,
    /// even if the capture _indices_ in each match are not.
    ///
    /// # Example
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
    /// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
    /// let Some((full, [year, month, day])) =
    ///     re.captures(hay).map(|caps| caps.extract()) else { return };
    /// assert_eq!("2010-03-14", full);
    /// assert_eq!("2010", year);
    /// assert_eq!("03", month);
    /// assert_eq!("14", day);
    /// ```
    ///
    /// # Example: iteration
    ///
    /// This example shows how to use this method when iterating over all
    /// `Captures` matches in a haystack.
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
    /// let hay = "1973-01-05, 1975-08-25 and 1980-10-18";
    ///
    /// let mut dates: Vec<(&str, &str, &str)> = vec![];
    /// for (_, [y, m, d]) in re.captures_iter(hay).map(|c| c.extract()) {
    ///     dates.push((y, m, d));
    /// }
    /// assert_eq!(dates, vec![
    ///     ("1973", "01", "05"),
    ///     ("1975", "08", "25"),
    ///     ("1980", "10", "18"),
    /// ]);
    /// ```
    ///
    /// # Example: parsing different formats
    ///
    /// This API is particularly useful when you need to extract a particular
    /// value that might occur in a different format. Consider, for example,
    /// an identifier that might be in double quotes or single quotes:
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r#"id:(?:"([^"]+)"|'([^']+)')"#).unwrap();
    /// let hay = r#"The first is id:"foo" and the second is id:'bar'."#;
    /// let mut ids = vec![];
    /// for (_, [id]) in re.captures_iter(hay).map(|c| c.extract()) {
    ///     ids.push(id);
    /// }
    /// assert_eq!(ids, vec!["foo", "bar"]);
    /// ```
    pub fn extract<const N: usize>(&self) -> (&'t str, [&'t str; N]) { ... }

BurntSushi avatar May 05 '23 01:05 BurntSushi