ripgrep
ripgrep copied to clipboard
support more sophisticated boolean matching operations
With the new convention to use the capitalized version of a short flag to indicate the opposite it's too bad that -E
is already used to mean --encoding
, as I would like to suggest an "inverse pattern" mode where only lines/words (depending on other parameters as normal) matching pattern e
but not matching pattern E
are included in the result set.
Andrew, I know you are loathe to add more !
support but given the pre-existing -E
, perhaps a -e !PATTERN
?
The name of the flag is really not the interesting part of this feature request. The interesting part is the request to support more sophisticated boolean tests.
I think if we were to decide to do this, then it needs to be part of a larger story that encompasses more sophisticated expressions. We also need to address the fact that, today, we can actually express quite a bit, but it requires piping. Namely, piping permits expressing "and". Piping plus the -v
flag permits any arbitrary boolean expression you might want. For example, rg foo | rg -v bar
says "show lines matching foo
but do not contain bar
," which is exactly your feature request.
git grep
has support for this via -not
, -and
and -or
. I don't know if I'm willing to add this to ripgrep. There must be a point at which we say, "piping is good enough."
An alternative way to implement this feature is in the regex engine itself (since intersection and complement are available as operations on regular languages), but this is extremely non-trivial to do.
I try not to speak in absolutes, but, "I don't want to add anything else that uses !
in a shell" is as close to an absolute that I can get. Let's drop that idea.
I understand completely. I currently pipe (to grep
, I didn't realize I could pipe to rg
itself!) but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.
Thanks.
but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.
Well, the "best" way is to, as I hinted at, build complement and intersection into the regex engine. But as I said, this is extremely non-trivial to do efficiently. If we were to implement this, then we'd need an algorithm that selects the (attempted) optimal matching path given all of the boolean conditions. e.g., if you said "x and not y and not z," then ripgrep would search for x
and only apply the y
and z
blacklist on matches to filter them out. If you had x or y or z
, then ripgrep would, as it does today, combine them into one regex joined by |
. If you had not x and not y and not z
, then ripgrep behave as it would today if you ran rg -v x
and then use the y
and z
blacklists to filter our matches. If you had not x or not y or not z
, then ripgrep could behave as it does today if you ran rg -v 'x|y|z'
. And so on...
It is plausible that this would result in a performance improvement. But you can't just throw that out there as a benefit and expect it to stick. :-) Performance does not exist in a vacuum. Pipelines tend to be constructed in a way that iteratively reduces the search space, which in turn makes performance less and less of an issue. The interesting bits are probably pipelines that start with an inverted match on a rarely occurring pattern, which would not reduce the search space much. Regardless, I personally find this to be a somewhat flimsy motivation for a feature like this unless someone can convince me otherwise. IMO, if we add a feature like this, it should be primarily for the UX.
Example of using git grep
with AND patterns:
git grep -e pattern1 --and -e pattern2 --and -e pattern3
Example of AND operation using Rust's regex engine:
rg -N '(?P<p1>.*pattern1.*)(?P<p2>.*pattern2.*)(?P<p3>.*pattern3.*)' file.txt
@kenorb That's presumably not the same as what git grep
does. git grep -e pattern1 --and -e pattern2
will match pattern2pattern1
but (.*pattern1.*)(.*pattern2.*)
will not. The standard way to perform "and" queries in ripgrep is with piping, as I mentioned above in my comment.
I quite like the simplicity and "natural feel" of using rg foo | rg bar
to do the equivalent of git grep -e foo --and -e bar
. The only significant difference is the color.
git grep -e foo --and -e bar
rg string | rg query
See, no highlight of the word string
in the rg
pipe.
@peterbe You should be able to fix that by adding --color always
to your first invocation of ripgrep. Not ideal of course.
I don't even know if it's possible with pipes but if you could know that that the next pipe is another rg
the --color always
could be on by default. One can dream.
Piping loses the file headers.
rg abc
a.txt
4: ...abc...xyz...
7: ...abc...
b.txt
3: ...abc...xyz...
rg abc | rg xyz
4: ...abc...xyz...
3: ...abc...xyz...
That example doesn't look right. It should retain file names not as headers but in each line in standard grep format.
Sorry my bad. It looks like this:
rg abc | rg xyz
a.txt: ...abc...xyz...
a.txt: ...abc...xyz...
b.txt: ...abc...xyz...
b.txt: ...abc...xyz...
Still hard to parse when there are many files. I think it's an example where the built-in op can provide better UX than piping.
Another example is piping with -A or -B.
// want to print a line including "abc" and "xyz" with +- 3 lines
rg abc -A 3 -B -3 | rg xyz -A 3 -B 3 // not what we want
That's certainly part of an argument in favor of this, but I will not allow that argument to be used as a hammer. Taken to its logical conclusion, ripgrep should bundle every conceivable transform on its data. At some point, people need to become OK with piping ripgrep's output and dealing with the different format. Different people will have different opinions on where that line is drawn.
I have definitely wished for an easy way to preserve headers when piping rg
to rg
. Maybe a flag for "header passthrough" would be useful on its own.
I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.
That would be nice but won't work in all cases. E.g., consider
rg -C5 foo | rg -v bar
Now the context lines around the matched lines in the first rg call are being matched by the second rg call and your output may end up being a bit of a mess and not what you might expect.
IMO, if we add a feature like this, it should be primarily for the UX.
Looking at a few now-closed duplicate issues, what most people want is just "a and not b" with all of headers/context preserved which might make sense to special-case if that's much simpler that the general case.
Files looks like this:
a.txt 4: ...abc... 30: ...xyz...
b.txt 4: ...abc... ..... (no 'xyz' in content)
How to find files like a.txt with 'abc' and 'xyz' in different lines?
Use multiline search.
On Fri, Feb 22, 2019, 19:35 amitbha [email protected] wrote:
Files looks like this:
a.txt 4: ...abc... 30: ...xyz...
b.txt 4: ...abc... ..... (no 'xyz' in content)
How to find files like a.txt with 'abc' and 'xyz' in different lines?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .
Use multiline search. … On Fri, Feb 22, 2019, 19:35 amitbha @.***> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .
Thanks for reply.
I tried rg -U --multiline-dotall -e 'abc.*xyz
, the right files were found. But there were too many outputs like:
4: ...abc... 5: xxxxx 6: xxxxx ... 29: xxxxx 30: ...xyz...
rg -U --multiline-dotall -e 'abc.*xyz | rg abc
No filename and line-numbers.
rg -U --multiline-dotall -l -e 'abc.*xyz' | rg 'abc' -
No result. How to read path from pipe?
rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg 'xyz' "$line"; done
Almost done! But filenames are missing. 😔
rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do echo "$line"; rg 'xyz' "$line"; echo; done
Done! 😌
Please skim the options in the man page. Use the -n and --with-filename flags.
On Sat, Feb 23, 2019, 03:25 amitbha [email protected] wrote:
Use multiline search. … <#m_6621645017383223918_> On Fri, Feb 22, 2019, 19:35 amitbha @.***> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment) https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .
Thanks for reply. I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were found. But there were too many outputs like:
4: ...abc... 5: xxxxx 6: xxxxx ... 29: xxxxx 30: ...xyz...
rg -U --multiline-dotall -e 'abc.*xyz | rg abc No filename and line-numbers.
rg -U --multiline-dotall -l -e 'abc.*xyz' | rg -e 'abc' - No result. How to read path from pipe?
rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg -e 'xyz' "$line"; done Almost done! But filenames are missing.
😔
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466628741, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34jwonl0CGHe9DS2PCPvcqLH8d2rFks5vQPr0gaJpZM4TEQ9s .
rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg --with-filename 'xyz' "$line"; echo; done
Got it!
😌
Friendly note: the utility of this feature is not in question. More comments explaining how useful this is or the kinds of problems it solves that aren't solved well by the status quo aren't necessary. The key thing blocking this feature is the potentially immense complexity that it adds not only to the implementation, but to the UX. It requires serious design work first, and it's still not clear to me that this is a feature I want to add.
It is well known that git grep
supports this stuff. If it does what you want, then just use that.
Please consider a utility rg --compile-expr a -and b -and c
generates relevant DFA.
Usage something like rg --dfa $(rg --compile-expr a -and -not b)
. This will seal complexity only in the compile-expr
option. Rest UX will remain identical.
Also piping is problematic for huge files as data is being copied again for every pipe.
Piping is also an issue when using e.g. --heading
@zachriggle That's already been mentioned.
re --and, I'm not sure if this is blasphemy or even correct at all and I'm probably missing edge cases but we could demorgan it...
$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | \
rg --pcre2 '^(?!((?!.*baz.*$)|(?!.*ello.*$)))'
Hello, baz
baz likes yellow
for matching any line containing baz
and ello
. perhaps a useful stop-gap for anyone desperate for a work-around?
@hraban If you just want a simple and query, then I'd probably recommend just doing
$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | rg baz | rg ello
With the downsides of course being that you lose the nice formatting and highlighting of baz
.
I'm going to suggest that maybe this issue and https://github.com/BurntSushi/ripgrep/issues/473 should be two separate issues.
Personally I'm not that interested in using complex boolean or regex patterns with ripgrep. I just want to be able to specify multiple patterns. Perhaps this could just be specified with a new flag like
rg --patterns "level=error" --patterns "requestID"
Maybe that's too simplistic, but I've been using rg nearly since it was started and I've never had any desire for anything besides a simple 'and' match on multiple patterns.
@sparrc Conceptually, you might be right. But in terms of implementation, I don't think there is much of a difference, so I'm treating them the same. Also, ripgrep does have the ability to search multiple patterns (using the same exact flags as grep
). It's just that it's a "or" match.
On top of that, the reason why just wanting "and" match is a little weird is because you can do it with pipelines: rg level=error | rg requestID
. It's just that the UX isn't quite as good...
@BurntSushi it's not just the UX (which is a major, unfixable problem IMHO. UX issues are much more important than "real" bugs, say, 100% slowdown of some cases).
One of the main reasons for me to use ripgrep
, and one of its advantages is speed, so I'm picking it when I'm searching large files. Using multiple pipes slows things down in some cases, as it copies the data, adds syscalls, etc.
This is not 100% the same search, and of course I picked a 3GB file with search terms appearing in most lines, but
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg >/dev/null
real 0m5.772s
user 0m5.017s
sys 0m0.754s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg >/dev/null
real 0m5.749s
user 0m4.987s
sys 0m0.760s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null
real 0m6.330s
user 0m7.147s
sys 0m2.781s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null
real 0m6.168s
user 0m7.245s
sys 0m2.777s
1 second hardly matter, but it is not uncommon for me to search 300GB of file.
+1 from me for multiple "AND" searches