ack3 icon indicating copy to clipboard operation
ack3 copied to clipboard

Feature Suggestion: ignore hard linked files by default

Open halfmanhalffish opened this issue 3 years ago • 17 comments

Since hard linked files are identical, search results from them will always be duplicates. It'd be a nice power tool feature to ignore all but one of them by default. Eg: ls file1 ln file1 subdir/file1 stat file1 [snip] Device: 802h/2050d Inode: 6325564 Links: 2 ack searchterm ./file1: ...searchterm... [but no search results from subdir/file1]

halfmanhalffish avatar Aug 12 '20 14:08 halfmanhalffish

ack already has get_file_id, which it uses to skip over duplicates specified on the command line, so this probably wouldn't be too bad to add!

hoelzro avatar Aug 12 '20 16:08 hoelzro

Since we already have the tooling to skip duplicates (whether by fs+inode or hash), having an option to omit duplicates seems very reasonable.

OTOH i frequently do want to know the duplicate files also match -- especially with -l (which files not which lines) -- so this canot be the default behavior with -l at least.

On forward compatibility grounds, I suggest this not be the default behavior (in ack3.$n++) but users who wish it to be default behavior can affirmatively set the needed option in .ackrc. (a hypothetical ack4 can change defaults incompatibly, of course).

How does this handle soft-link aliases ? The final resolved dev:inode will still be the same so those would be weeded out too?

n1vux avatar Aug 12 '20 18:08 n1vux

Before we get into the "how it could be done", let's talk interface.

In an example where you have, say:

ln foo foo1
ln foo subdir/foo2
ln foo other/dir/foo3
ln foo4

Which file gets recognized and which ones get ignored?

Will it just be whichever one gets directory traversed first? That means that sometimes you might get a hit in foo and the next time it's in foo4 because we don't know what order the entries are coming back int.

petdance avatar Aug 12 '20 19:08 petdance

There are several reasons this should not be "by default"

  • Memory usage (potentially big uptick on large filesystem) (h/t @hoelzro )
  • backwards compatibility
  • grep compatibility
  • hiding information by default
  • users such as OP who want this default have an easy way to do that, in .ackrc

n1vux avatar Aug 12 '20 19:08 n1vux

Which file gets recognized and which ones get ignored? Will it just be whichever one gets directory traversed first?

With any known Unix/Linux filesystem, there is no memory of which hardlink was "first", so yes.

If the order files are reported by Ack is non-deterministic, which will be suppressed/included will be also under this option.

We already have an option --sort-files to pay extra to get determinism, so that is an already solved problem?

n1vux avatar Aug 12 '20 19:08 n1vux

Before we get into the "how it could be done", let's talk interface.

I suspect point of @hoelzro 's "implementation" comment is that we're already paying the filesystem hit performance penalty needed for this, so it's a "low hanging fruit" as opposed to "expensive wishlist but don't hold your breath".

(Rob is correct that on a large filesystem there could be significant memory cost to keeping an additional hash, even if it's just {"$dev:$inode"=>1 ,...} but if it's a cost only paid on an affirmative option that doesn't bother me.)

n1vux avatar Aug 12 '20 19:08 n1vux

With any known Unix/Linux filesystem, there is no memory of which hardlink was "first", so yes.

Just because i don't know of one doesn't mean there is not a single one ... but that's all we can assume (and not even that on Windows!)

Does MacOSX f/s remember which hardlink was original primary? They have extra metadata of strange sorts ...

n1vux avatar Aug 12 '20 20:08 n1vux

So the question, @halfmanhalffish, is: If you can get different results from two different runs of ack without any of the contents of your files changing, is that still a useful feature?

petdance avatar Aug 13 '20 15:08 petdance

Sorry @petdance I don't understand your meaning. 'two different runs' ... do you mean 2 runs with different switches?

Incidentally, I've worked around this hardlink problem by using ag (apologies!!! but ack doesn't seem to have this feature) which by default ignores files listed in .gitignore (which naturally includes the hard linked files I want to disregard).

halfmanhalffish avatar Aug 13 '20 16:08 halfmanhalffish

Sorry @petdance I don't understand your meaning. 'two different runs' ... do you mean 2 runs with different switches?

No. I'm saying that if we ignore hard links it is possible that you could get two different sets of results from the same invocation of ack, without the contents of the files changing.

$ echo blah > foo
$ ln foo foo1
$ ln foo foo2
$ ln foo foo3
$ ack blah
foo:1:blah
$ ack blah
foo2:1:blah

Note how the first invocation of ack reported a match in foo, and the second in foo2, even though we didn't do anything to change the file contents. That's because we don't know what order we'll be getting filenames in.

by using ag (apologies!!! but ack doesn't seem to have this feature)

There's nothing to apologize for. I want you to use the tool that works best for you. I'm pretty sure that ripgrep also can respect .gitignore. Here's a feature comparison chart.

You can also put some --ignore-file options in your project's .ackrc. That would achieve the same thing, although it would mean having the same files ignored in two different ways in two different files (.gitignore and .ackrc).

petdance avatar Aug 13 '20 16:08 petdance

If ack is invoked the same way and the fs isn't changed then one wouldn't expect different output. But you're right, if certain ack switches meant the fs was being searched differently then one wouldn't want matches from different hard linked files showing up.

Someone wrote above that this feature should be opt-in (and I think he was right). So if there's to be config or switches then why not just require the config to specify which hard linked file to check?

halfmanhalffish avatar Aug 13 '20 17:08 halfmanhalffish

If ack is invoked the same way and the fs isn't changed then one wouldn't expect different output.

Right, but who knows how or when the filesystem will change?

why not just require the config to specify which hard linked file to check?

I'm not sure what you mean. If you've got foo, foo2 and foo3 that are all hard links to each other, how would we specify which one of the three is the one that you'd want to have matched, and which two would be ignored?

petdance avatar Aug 13 '20 18:08 petdance

Perhaps ...

ack --ignore-hard-linked-files would show matches only from the file 'nearest' to the working directory, where nearest would be judged either by file+pathname length as a string, or by the number of subdirectories.

Or specify files explicitly ack --ignore-hard-linked-files=dir/file1,dir/dir2/file2

halfmanhalffish avatar Aug 13 '20 18:08 halfmanhalffish

ack --ignore-hard-linked-files would show matches only from the file 'nearest' to the working directory

But what if they're all in the same directory, as in my example above?

ack --ignore-hard-linked-files=dir/file1,dir/dir2/file2

You already have --ignore-file=is:dir/file1.

petdance avatar Aug 13 '20 19:08 petdance

If all in same directory then pick the one with the shortest name. If same name length then lexically in the strcmp() sense, ie 'a' < 'b'. Good point about --ignore-file=is:dir/file1.

halfmanhalffish avatar Aug 13 '20 19:08 halfmanhalffish

There's another reason such a feature should not be on by default: surprise.

When I use a hard link, I'm generally expressing "I want this file to be in two places in the file system at once, as if one were a cp of the other and automatically synchronized" but for the linkage to be practically undetectable. If I wanted special "do not traverse" behavior, I would use a symlink. The fact that I used a hard link means that I would expect that almost any operation performed on a group of files will be performed on the file once per link, not just once. ack x returning lines from just a single instance of the file when I've got two hard linked copies in the tree would surprise me, and look like an ack bug.

On interface, I recommend against a name like --ignore-hard-linked-files which sounds like it would ignore all instances of a file with more than one link, rather than deduplicating the search space to a single file instance. Perhaps something like --skip-duplicate-links would be a clearer way to express the feature's behavior, and it could apply both to multiple symlinked instances of the file found via --follow as well as multiple hard links.

flwyd avatar Oct 23 '21 06:10 flwyd

Well spoken, but I don't think you can assume that symlinks are a free and easy alternative to hard links. Many programs, including at least some compilers, won't follow symlinks, so you're forced to use a hard link.

Perhaps there isn't any 100% right answer to this question.

On 23/10/2021, Trevor Stone @.***> wrote:

There's another reason such a feature should not be on by default: surprise.

When I use a hard link, I'm generally expressing "I want this file to be in two places in the file system at once, as if one were a cp of the other and automatically synchronized" but for the linkage to be practically undetectable. If I wanted special "do not traverse" behavior, I would use a symlink. The fact that I used a hard link means that I would expect that almost any operation performed on a group of files will be performed on the file once per link, not just once. ack x returning lines from just a single instance of the file when I've got two hard linked copies in the tree would surprise me, and look like an ack bug.

On interface, I recommend against a name like --ignore-hard-linked-files which sounds like it would ignore all instances of a file with more than one link, rather than deduplicating the search space to a single file instance. Perhaps something like --skip-duplicate-links would be a clearer way to express the feature's behavior, and it could apply both to multiple symlinked instances of the file found via --follow as well as multiple hard links.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/beyondgrep/ack3/issues/314#issuecomment-950099110

halfmanhalffish avatar Oct 23 '21 10:10 halfmanhalffish