ripgrep
ripgrep copied to clipboard
Implement such `--files-from` option
A recurring workflow of mine is to search within an existing list of files.
Currently I'm living by
$(generate list of files) | while read f; do rg pattern "$f"; done
which is both inconvenient and inefficient.
Ack does provide a --files-from
option. Implementing it would allow me to type
rg pattern --list-from <(gen list)
to fulfill my needs.
It seems like there are a few ways to do this without building it into ripgrep. Here are a couple:
[andrew@Cheetah 273] echo test > foo
[andrew@Cheetah 273] echo test > bar
[andrew@Cheetah 273] echo test > baz
[andrew@Cheetah 273] cat > file-list <<EOF
> foo
> bar
> baz
> EOF
[andrew@Cheetah 273] xargs rg test < file-list
foo
1:test
bar
1:test
baz
1:test
[andrew@Cheetah 273] rg test $(cat file-list)
bar
1:test
foo
1:test
baz
1:test
I don't think either of these approaches is less efficient than what ripgrep would do if it were built-in. The only caveat here is that if your file list is big enough, you'll need to use xargs
, which will split up the argument list correctly.
Could you explain in more detail why these approaches don't work for you?
Sure. None of your proposed alternatives work with filenames containing spaces. Try e.g.
echo test > "foo a"
echo "foo a" > file-list
xargs rg test < file-list
foo: No such file or directory (os error 2)
a: No such file or directory (os error 2)
I guess the standard solution to that is to delimit your file paths will a NUL terminator (e.g., find ./ -print0
) and then tell xargs
to read them using xargs -0
.
If you aren't generating files with find
(or some other tool that can be made to emit NUL terminators), then it seems like you should be able to use xargs -d'\n' rg test < file-list
?
Fair enough. So, for the record, instead of the syntax I proposed
rg pattern --list-from <(gen list)
I can achieve the same results using
xargs -d'\n' -a <(gen list) rg pattern
Not as convenient, but I can very well live with that. Thanks!
Yes, I think I'd prefer that at this point.
Popping up a level, do also note that ripgrep provides the -g/--glob
flag, which allows you to apply ad hoc filters on which files/directories are searched. This obviously only works for simplistic cases where your rules are simple, but it does cover a lot of the simpler uses of find ./ ... | xargs grep ...
.
@BurntSushi My use-case is similar to his where I compile a list of files I'm interested in and search only those files instead of letting ripgrep loose on my entire project which would take a lot longer. Like you suggested, I've been using xargs -d '\n' rg PATTERN < FILELIST
.
Next, I wanted to search only some specific filetypes (say C++ source files) within FILELIST so I tried to add a -tcsrc
(csrc is a type I created which is defined in ~/.ripgreprc config file) but that doesn't work as ripgrep seems to ignore any glob/type arguments if provided with an explicit list of files to search from. So I ended up doing xargs -d '\n' rg PATTERN < <(rg '\.(cc|cpp)$' FILELIST)
to pre-process the FILELIST before running.
This is kinda bad as I've defined the csrc type elsewhere but I'm not able to use it in this context. Is there a better way to go about this? It'd be nice if ripgrep filters the list of files provided using the type/glob argument if one is provided eg. xargs < FILELIST rg -tcsrc PATTERN
@kshenoy ripgrep has, and probably always will, explicitly ignore any filtering for file paths that are explicitly given on the command line. I realize that for your particular niche case, this isn't what you want, but to do otherwise would grossly complicate the already complex filtering logic that ripgrep performs. e.g., running rg foo blah.py
and getting nothing back even if there was a match because blah.py
is in your .gitignore
would be quite an egregious UX fail. You might instead argue that file type filtering is different because it's explicitly provided on the command line, but it's still something that violates what is now a pretty iron clad rule: "If you give a file path to ripgrep, it will search it."
My use-case is similar to his where I compile a list of files I'm interested in and search only those files instead of letting ripgrep loose on my entire project which would take a lot longer.
You might instead consider using a .ignore
or .rgignore
file to dictate which files should be skipped when searching your project.
"If you give a file path to ripgrep, it will search it."
That's a reasonable rule to follow. I agree that doing anything else would involve prioritizing between different ways to include/exclude files. Thanks for the clarification.
You might instead consider using a .ignore or .rgignore file to dictate which files should be skipped when searching your project.
I did consider doing that. However, we use Perforce at work and it's easier to compile a list to search through using p4 have ...
than to compile a list to ignore. I opted to create a wrapper around rg which adds the --files-from option similar to ack. It may be a little over-engineered :) but it seems to work. Any suggestions for improvement are welcome.
it would be nice to be able to pipe a list of files to ripgrep. right now, I searched for a second pattern in files matching a first pattern with
rg "pattern2" --files-without-match $(rg "pattern1" --files-with-matches)
when it would be nice to do the following because I usually think of the first pattern first
rg "pattern1" --files-with-matches | rg "pattern2" --files-without-matches
although, this use case is unique since I'm using --files-without-matches
, which doesn't work with xargs since xargs calls a different ripgrep process for each file, and so ripgrep will end up printing a bunch more files than I intended it to
Could you explain in more detail why these approaches don't work for you?
What about windows?
Main issue that there is no xargs.
And if you try to add all files to command line:
rg "pattern" C:/longpath1/file1 C:/longpath2/file2 ... C:/longpath200/file200
then it exceeds maximum command length and doesn't work.
I want to search all vimhelp files provided in vim runtime path and there are a lot of files (including various plugin documentation).
I'm re-opening this because it seems impossible or difficult to work around this when xargs is not present.
What should the flag name for this be? --files-from
has been proposed. Is that the best name?
Also, should files specified via this method be subject to smart filtering or globs? Files specified on the command line are not, so I would think these shouldn't be either. That is, files to be searched via this method should act as if they were given on the command line.
I'm re-opening this because it seems impossible or difficult to work around this when xargs is not present.
That's a great news !
What should the flag name for this be?
--files-from
has been proposed. Is that the best name?
I proposed files-from
after Tar:
man tar|rg -s -A8 -- '-T, --files-from'
-T, --files-from=FILE
Get names to extract or create from FILE.
Unless specified otherwise, the FILE must contain a list of names separated by ASCII LF (i.e. one name per line). The names read are handled the same way as command line arguments. They undergo quote removal and
word splitting, and any string that starts with a - is handled as tar command line option.
If this behavior is undesirable, it can be turned off using the --verbatim-files-from option.
The --null option instructs tar that the names in FILE are separated by ASCII NUL character, instead of LF. It is useful if the list is generated by find(1) -print0 predicate.
Files specified on the command line are not, so I would think these shouldn't be either.
Seconded.
file
, rsync
, and (as mentioned in the OP) ack
also use --files-from
.
file
basically treats the paths as if they were given on the command line. rsync
treats them kind of like includes relative to the source directory, but it also applies include/exclude patterns to them. ack
treats them like how rg
would treat paths given on the command line — glob patterns and type filters don't apply to them. I guess that's a strong precedent
@okdana Aye. I also think that if the list of files is given explicitly like this, then users can use other mechanisms of filtering very easily before passing the file to ripgrep. For example, you might use git ls-files
to get a list of files tracked by git instead of needing to rely on ripgrep's smart filtering.
And also, come to think of it, if we did allow gitignore or other filters to apply to the list of files given, that would probably prevent this feature from being implemented in any reasonable time frame. gitignore matching, for example, is pretty heavily coupled to directory traversal. Applying -g/--glob
rules would probably be easy though.
- I don't care much about the name of the flag so long these work:
find / | rg pattern --file-from
rg --file-from=FILE
-
find / | rg pattern --file-from
should start processing right away in a streaming fashion, and not wait for stdin to be closed (likewise withrg --file-from=FILE
)
@timotheecour I would expect you to have to write find / | rg pattern --files-from -
, where the -
is an idiom for opening stdin.
Executing the search before stdin is closed is interesting. That will require some re-factoring inside ripgrep, since right now, it stores the complete set of paths to search in memory. (Because it was always in memory via CLI arguments.) I agree that streaming is probably the right option, although that may be an enhancement that comes after the initial feature lands, depending on how difficult that refactoring is.
enhancement that comes after the initial feature lands
totally fine, then let's keep this issue open till then :-)
No, if that happens, then I'll close this issue and open a new one.
Using xargs adds few seconds to the execution time when the file list contains 20 000 paths. xargs.exe -a "%tmp%\filelist.txt" -d '\n' rg foobar
. Its fine, but ripgrep itself is way faster on the same directory tree that filelist.txt was generated from. I am using fzf
to select the files to search in for ripgrep.
Another possible motivation is that using xargs
leads to non-optimal performance. E.g. inside a checkout of gecko-dev,
$ time git ls-files -z | xargs -0 rg symlinks >/dev/null
_______________________________________________________
Executed in 1,99 secs fish external
usr time 2,64 secs 434,00 micros 2,64 secs
sys time 2,67 secs 88,00 micros 2,67 secs
but if I just add -P4
to xargs
's arguments, the performance improves dramatically:
$ time git ls-files -z | xargs -0 -P4 rg symlinks >/dev/null
________________________________________________________
Executed in 811,66 millis fish external
usr time 3,41 secs 0,00 millis 3,41 secs
sys time 3,26 secs 1,77 millis 3,25 secs
That's a more than 2x improvement.
IIUC the -P
option is unsafe to use (can lead to broken mixing of outputs), but it should demonstrate that somewhere inside the processing gets parallelized poorly. And it's not like there can be a lot of lock contention during output: the result is only 500 lines for this example.
@BurntSushi , let me share big feedback on using rg
(especially xargs rg
). I hope this post will be useful for you and for other people using xargs rg
. My rg version is 13.0.0.
Several years ago I started writing script called host-find
for searching my home directory. My task was very special: this script should skip all Chrome profiles (Chrome profile is any directory, which has NativeMessagingHosts
subdirectory) and skip all git submodules (git submodule is any directory, which has .git
regular file as opposed to .git
directory). (Of course, this is very special task, so I'm not asking for adding this to rg.)
So I wrote C++ program called my-find
which prepares file list I need. And then I wrote bash script host-find
, which essentially does my-find ... | xargs ... grep --color=always ... | less -R
(actual command line is bigger, of course).
And this host-find
did its functions well for some years. But yesterday I decided to speed up it, so I decided to replace grep with rg.
First of all I needed to know whether rg
can read file list from stdin (so that I can remove xargs
). I found this bug report (i. e. https://github.com/BurntSushi/ripgrep/issues/273 ), so it seems it cannot. So I decided to use xargs rg
.
Then I needed to know what rg options make rg fully compatible with grep (so that I can simply replace grep
with rg
in my command line). I have read README, FAQ and rg --help
. All this texts (at first sight) say that rg -uuu
is similar to grep -r
. But I kept reading and doing experiments and found that this is lie! It turned out that grep
always keeps order of files supplied as arguments intact. But rg
can reorder them. (And I need predictable order!). So:
Bug # 1. Docs may be interpreted as saying that rg -uuu
is equivalent to grep -r
, but this is not true.
It would be great if rg
docs will present some rg
command, which will be fully compatible with grep
.
Moreover:
Bug # 2. I was unable to find a way to keep argument order intact. rg -j 1
seems to work, but this is not documented.
Fortunately, my-find
outputs files in sorted order, so I simply added --sort=path
to rg
invocation. But here lies another problem: what exactly "sorted" means? my-find
employs very particular sorting order: is splits path to parts and then sorts resulting word arrays lexicographically. In other words, my-find
sorts path in this order:
a/a
a+
a0
What order rg --sort=path
uses? (Remember that I use xargs rg
, so rg
may be invoked many times!) If it uses different order, this will mean that files will be sorted using one method inside one rg
invocation, but sorted using different method between rg
invocations (inside one xargs
invocation).
So I did experiments and fortunately I found that rg --sort=path
sorting is compatible with my-find
sorting. But this is not documented and I reported this separately. So:
Bug # 3. I reported it here: https://github.com/BurntSushi/ripgrep/issues/2418
When rg
output is redirected and you use -B
and -A
options (it is redirected to less
in my case and I actually use -B
and -A
), rg
inserts --
between found matches. But --
is not inserted between different invocations of rg
inside one xargs
invocation. But I found solution: just add --heading
. So:
Advice for xargs users. Add --heading
.
But what if particular rg
invocation will get exactly one file? Then heading will not be displayed. So:
Advice for xargs users. Add /dev/null
. I. e.: xargs ... rg "$REGEX" /dev/null
. Heading will be always displayed.
Some my files have actual --
in them. So it is impossible to distinguish real "--" with rg-generated one (when using -B
, -A
and --heading
). Because rg-generated --
is colorless even when I pass --color=always
. So:
Bug # 4. --
is colorless even in color mode
I found a workaround:
General advice. Pass --line-number
when using -B
(or -A
) and --heading
I have read rg changelog and found that rg sometimes does breaking changes. So I added to my script this:
if [ "$(rg --version | head -n 1)" != "ripgrep 13.0.0" ]; then
echo "${0##*/}: rg problems" >&2
exit 1
fi
Actual rg
invocation in my host-find
script looks similar to this:
my-find ... | { xargs -d '\n' --no-run-if-empty -- rg -uuu --heading --color=always --no-messages -B 10 -A 10 --no-config --sort=path --line-number -- "$REGEX" /dev/null || :; } | less -R
Notes:
-
-d '\n'
mean thatxargs
should not apply special processing to its input.xargs
should simply split lines and do nothing else -
--no-run-if-empty
- it seems this option is not needed, because I already passed/dev/null
torg
. But I pass--no-run-if-empty
just in case - It seems
-uuu
is not needed, because I pass files themselves torg
. But, again, I pass-uuu
just in case - I passed
--no-messages
, because sometimes I have no read permission - I pass
--
before"$REGEX"
, because regex may begin with-
- If any
rg
invocation doesn't find matches,rg
will return 1. Thusxargs
will return non-zero. And my script will fail, because it hasset -e
. So I added|| :
So I eventually overcame all xargs rg
quirks. I actually got speed up. :) New version of host-find
works for 10.48 s on my data. Previous (grep-based) version worked 60.53 s on same data. (rg
is single threaded because of --sort=path
, so multi threaded version will be even faster.)
All these bugs are subjective, so I didn't report most of them as separate reports. But if you ( @BurntSushi ) want, I will do this
Thanks for the feedback but in the future, please just file a new issue instead of attaching to an existing one. The vast majority of your comment is irrelevant to his issue.
Also, most of the problems you ran into are also quirks of grep. The --
issue for example and the heading issue. You could just do --no-heading
and get grep output format, for example.
To comment specifically on a couple things...
Bug # 1. Docs may be interpreted as saying that
rg -uuu
is equivalent togrep -r
, but this is not true.
The README says, "Automatic filtering can be disabled with rg -uuu." And that's not a lie. And the docs for the -u
flag say:
-u, --unrestricted
Reduce the level of "smart" searching. A single -u won't respect .gitignore
(etc.) files (--no-ignore). Two -u flags will additionally search hidden files
and directories (-./--hidden). Three -u flags will additionally search binary
files (--binary).
'rg -uuu' is roughly equivalent to 'grep -r'.
And in context, that is absolutely correct. It obviously doesn't mean that rg -uuu
is precisely equivalent to grep -r
in literally every possible way. It means that ripgrep will search the same set of files.
It would be great if
rg
docs will present somerg
command, which will be fully compatible withgrep
.
It never will because ripgrep never was, is or will be fully compatible with grep. Once again: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#posix4ever
Bug # 2. I was unable to find a way to keep argument order intact.
rg -j 1
seems to work, but this is not documented.
It's not documented because it's not guaranteed. No such documentation exists for grep either.
Another motivation is that using it inside vim with git ls files is really cumbersome and not os dependent.