ripgrep support more sophisticated boolean matching operations

support more sophisticated boolean matching operations

Open mqudsi opened this issue 6 years ago • 57 comments

With the new convention to use the capitalized version of a short flag to indicate the opposite it's too bad that -E is already used to mean --encoding, as I would like to suggest an "inverse pattern" mode where only lines/words (depending on other parameters as normal) matching pattern e but not matching pattern E are included in the result set.

Andrew, I know you are loathe to add more ! support but given the pre-existing -E, perhaps a -e !PATTERN?

Apr 03 '18 00:04 mqudsi

The name of the flag is really not the interesting part of this feature request. The interesting part is the request to support more sophisticated boolean tests.

I think if we were to decide to do this, then it needs to be part of a larger story that encompasses more sophisticated expressions. We also need to address the fact that, today, we can actually express quite a bit, but it requires piping. Namely, piping permits expressing "and". Piping plus the -v flag permits any arbitrary boolean expression you might want. For example, rg foo | rg -v bar says "show lines matching foo but do not contain bar," which is exactly your feature request.

git grep has support for this via -not, -and and -or. I don't know if I'm willing to add this to ripgrep. There must be a point at which we say, "piping is good enough."

An alternative way to implement this feature is in the regex engine itself (since intersection and complement are available as operations on regular languages), but this is extremely non-trivial to do.

I try not to speak in absolutes, but, "I don't want to add anything else that uses ! in a shell" is as close to an absolute that I can get. Let's drop that idea.

Apr 03 '18 00:04 BurntSushi

I understand completely. I currently pipe (to grep, I didn't realize I could pipe to rg itself!) but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Thanks.

Apr 03 '18 00:04 mqudsi

but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Well, the "best" way is to, as I hinted at, build complement and intersection into the regex engine. But as I said, this is extremely non-trivial to do efficiently. If we were to implement this, then we'd need an algorithm that selects the (attempted) optimal matching path given all of the boolean conditions. e.g., if you said "x and not y and not z," then ripgrep would search for x and only apply the y and z blacklist on matches to filter them out. If you had x or y or z, then ripgrep would, as it does today, combine them into one regex joined by |. If you had not x and not y and not z, then ripgrep behave as it would today if you ran rg -v x and then use the y and z blacklists to filter our matches. If you had not x or not y or not z, then ripgrep could behave as it does today if you ran rg -v 'x|y|z'. And so on...

It is plausible that this would result in a performance improvement. But you can't just throw that out there as a benefit and expect it to stick. :-) Performance does not exist in a vacuum. Pipelines tend to be constructed in a way that iteratively reduces the search space, which in turn makes performance less and less of an issue. The interesting bits are probably pipelines that start with an inverted match on a rarely occurring pattern, which would not reduce the search space much. Regardless, I personally find this to be a somewhat flimsy motivation for a feature like this unless someone can convince me otherwise. IMO, if we add a feature like this, it should be primarily for the UX.

Apr 03 '18 00:04 BurntSushi

Example of using git grep with AND patterns:

git grep -e pattern1 --and -e pattern2 --and -e pattern3

Apr 11 '18 00:04 kenorb

Example of AND operation using Rust's regex engine:

rg -N '(?P<p1>.*pattern1.*)(?P<p2>.*pattern2.*)(?P<p3>.*pattern3.*)' file.txt

Apr 11 '18 01:04 kenorb

@kenorb That's presumably not the same as what git grep does. git grep -e pattern1 --and -e pattern2 will match pattern2pattern1 but (.*pattern1.*)(.*pattern2.*) will not. The standard way to perform "and" queries in ripgrep is with piping, as I mentioned above in my comment.

Apr 11 '18 01:04 BurntSushi

I quite like the simplicity and "natural feel" of using rg foo | rg bar to do the equivalent of git grep -e foo --and -e bar. The only significant difference is the color.

git grep -e foo --and -e bar screen shot 2018-06-06 at 8 03 13 am

rg string | rg query screen shot 2018-06-06 at 8 04 42 am

See, no highlight of the word string in the rg pipe.

Jun 06 '18 12:06 peterbe

@peterbe You should be able to fix that by adding --color always to your first invocation of ripgrep. Not ideal of course.

Jun 06 '18 12:06 BurntSushi

I don't even know if it's possible with pipes but if you could know that that the next pipe is another rg the --color always could be on by default. One can dream.

Jun 06 '18 12:06 peterbe

Piping loses the file headers.

rg abc

a.txt
4: ...abc...xyz...
7: ...abc...

b.txt
3: ...abc...xyz...

rg abc | rg xyz

4: ...abc...xyz...
3: ...abc...xyz...

Jun 29 '18 07:06 elbaro

That example doesn't look right. It should retain file names not as headers but in each line in standard grep format.

Jun 29 '18 09:06 BurntSushi

Sorry my bad. It looks like this:

rg abc | rg xyz
a.txt: ...abc...xyz...
a.txt: ...abc...xyz...
b.txt: ...abc...xyz...
b.txt: ...abc...xyz...

Still hard to parse when there are many files. I think it's an example where the built-in op can provide better UX than piping.

Another example is piping with -A or -B.

// want to print a line including "abc" and "xyz" with +- 3 lines
rg abc -A 3 -B -3 | rg xyz -A 3 -B 3  // not what we want

Jun 29 '18 15:06 elbaro

That's certainly part of an argument in favor of this, but I will not allow that argument to be used as a hammer. Taken to its logical conclusion, ripgrep should bundle every conceivable transform on its data. At some point, people need to become OK with piping ripgrep's output and dealing with the different format. Different people will have different opinions on where that line is drawn.

Jun 29 '18 15:06 BurntSushi

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

Sep 21 '18 19:09 BatmanAoD

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

That would be nice but won't work in all cases. E.g., consider

rg -C5 foo | rg -v bar

Now the context lines around the matched lines in the first rg call are being matched by the second rg call and your output may end up being a bit of a mess and not what you might expect.

IMO, if we add a feature like this, it should be primarily for the UX.

Looking at a few now-closed duplicate issues, what most people want is just "a and not b" with all of headers/context preserved which might make sense to special-case if that's much simpler that the general case.

Jan 07 '19 11:01 aldanor

Files looks like this:

a.txt 4: ...abc... 30: ...xyz...

b.txt 4: ...abc... ..... (no 'xyz' in content)

How to find files like a.txt with 'abc' and 'xyz' in different lines?

Feb 23 '19 00:02 amitbha

Use multiline search.

On Fri, Feb 22, 2019, 19:35 amitbha [email protected] wrote:

Files looks like this:

a.txt 4: ...abc... 30: ...xyz...

b.txt 4: ...abc... ..... (no 'xyz' in content)

How to find files like a.txt with 'abc' and 'xyz' in different lines?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .

Feb 23 '19 04:02 BurntSushi

Use multiline search. … On Fri, Feb 22, 2019, 19:35 amitbha @.***> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .

Thanks for reply. I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were found. But there were too many outputs like:

4: ...abc... 5: xxxxx 6: xxxxx ... 29: xxxxx 30: ...xyz...

rg -U --multiline-dotall -e 'abc.*xyz | rg abc No filename and line-numbers.

rg -U --multiline-dotall -l -e 'abc.*xyz' | rg 'abc' - No result. How to read path from pipe?

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg 'xyz' "$line"; done Almost done! But filenames are missing. 😔

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do echo "$line"; rg 'xyz' "$line"; echo; done Done! 😌

Feb 23 '19 08:02 amitbha

Please skim the options in the man page. Use the -n and --with-filename flags.

On Sat, Feb 23, 2019, 03:25 amitbha [email protected] wrote:

Use multiline search. … <#m_6621645017383223918_> On Fri, Feb 22, 2019, 19:35 amitbha @.***> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment) https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .

Thanks for reply. I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were found. But there were too many outputs like:

4: ...abc... 5: xxxxx 6: xxxxx ... 29: xxxxx 30: ...xyz...

rg -U --multiline-dotall -e 'abc.*xyz | rg abc No filename and line-numbers.

rg -U --multiline-dotall -l -e 'abc.*xyz' | rg -e 'abc' - No result. How to read path from pipe?

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg -e 'xyz' "$line"; done Almost done! But filenames are missing.

😔

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466628741, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34jwonl0CGHe9DS2PCPvcqLH8d2rFks5vQPr0gaJpZM4TEQ9s .

Feb 23 '19 13:02 BurntSushi

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg --with-filename 'xyz' "$line"; echo; done Got it! 😌

Feb 24 '19 09:02 amitbha

Friendly note: the utility of this feature is not in question. More comments explaining how useful this is or the kinds of problems it solves that aren't solved well by the status quo aren't necessary. The key thing blocking this feature is the potentially immense complexity that it adds not only to the implementation, but to the UX. It requires serious design work first, and it's still not clear to me that this is a feature I want to add.

It is well known that git grep supports this stuff. If it does what you want, then just use that.

Feb 24 '19 13:02 BurntSushi

Please consider a utility rg --compile-expr a -and b -and c generates relevant DFA.

Usage something like rg --dfa $(rg --compile-expr a -and -not b). This will seal complexity only in the compile-expr option. Rest UX will remain identical.

Also piping is problematic for huge files as data is being copied again for every pipe.

Dec 04 '19 14:12 elazarl

Piping is also an issue when using e.g. --heading

Apr 01 '20 03:04 zachriggle

@zachriggle That's already been mentioned.

Apr 01 '20 12:04 BurntSushi

re --and, I'm not sure if this is blasphemy or even correct at all and I'm probably missing edge cases but we could demorgan it...

$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | \
    rg --pcre2 '^(?!((?!.*baz.*$)|(?!.*ello.*$)))'
Hello, baz
baz likes yellow

for matching any line containing baz and ello. perhaps a useful stop-gap for anyone desperate for a work-around?

May 20 '20 11:05 hraban

@hraban If you just want a simple and query, then I'd probably recommend just doing

$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | rg baz | rg ello

With the downsides of course being that you lose the nice formatting and highlighting of baz.

May 20 '20 12:05 BurntSushi

I'm going to suggest that maybe this issue and https://github.com/BurntSushi/ripgrep/issues/473 should be two separate issues.

Personally I'm not that interested in using complex boolean or regex patterns with ripgrep. I just want to be able to specify multiple patterns. Perhaps this could just be specified with a new flag like

rg --patterns "level=error" --patterns "requestID"

Maybe that's too simplistic, but I've been using rg nearly since it was started and I've never had any desire for anything besides a simple 'and' match on multiple patterns.

Oct 28 '20 22:10 sparrc

@sparrc Conceptually, you might be right. But in terms of implementation, I don't think there is much of a difference, so I'm treating them the same. Also, ripgrep does have the ability to search multiple patterns (using the same exact flags as grep). It's just that it's a "or" match.

On top of that, the reason why just wanting "and" match is a little weird is because you can do it with pipelines: rg level=error | rg requestID. It's just that the UX isn't quite as good...

Oct 28 '20 22:10 BurntSushi

@BurntSushi it's not just the UX (which is a major, unfixable problem IMHO. UX issues are much more important than "real" bugs, say, 100% slowdown of some cases).

One of the main reasons for me to use ripgrep, and one of its advantages is speed, so I'm picking it when I'm searching large files. Using multiple pipes slows things down in some cases, as it copies the data, adds syscalls, etc.

This is not 100% the same search, and of course I picked a 3GB file with search terms appearing in most lines, but

$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg  >/dev/null

real    0m5.772s
user    0m5.017s
sys     0m0.754s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg  >/dev/null

real    0m5.749s
user    0m4.987s
sys     0m0.760s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null

real    0m6.330s
user    0m7.147s
sys     0m2.781s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null

real    0m6.168s
user    0m7.245s
sys     0m2.777s

1 second hardly matter, but it is not uncommon for me to search 300GB of file.

Oct 29 '20 07:10 elazarl

+1 from me for multiple "AND" searches

Nov 04 '20 20:11 gd4c

ripgrep ripgrep copied to clipboard

support more sophisticated boolean matching operations

ripgrep
ripgrep copied to clipboard