TMSU icon indicating copy to clipboard operation
TMSU copied to clipboard

tmsu dupes can't distinguish identical files vs the same file

Open ashleyharvey opened this issue 5 years ago • 8 comments

Tagging ./foo.txt as well as ../fooDir/foo.txt creates two database entries that point to the same file.

TMSU correctly identifies them as duplicates (is this by inode, checksum, or some other method?); however, none of the resolution options make sense because they assume two identical but different files, rather than a a different path that happens to point to the same file.

I wonder if the same thing would happen with soft links. I haven't tested that yet.

Proposed resolution: using the inode, compare it to the duplicate, and if they match, merge the tags. and remove one of the db entries. Perhaps ask the user first. I can't imagine why you'd want two entries (with different paths) pointing to the same file, with different tags.

ashleyharvey avatar Mar 19 '19 08:03 ashleyharvey

I don't see this as a problem. TMSU works at the filename abstraction, not the inode level. Indeed many filesystems don't have inodes. Just because a filesystem supports hard links doesn't mean you want to treat the files as the same at the filename abstraction — if you did why would you bother creating the hard link in the first place?

Say you tag files /some/path/a and /some/path/b that happen to be hard links to the same inode. You may later on decide you want these two files to be different, so you delete /some/path/b and then create a new file there. TMSU will then recognise this change when you come to do a status or repair.

I think the TMSU would be more confusing if it treated two hard links as the same file.

On Tue, 19 Mar 2019, 08:35 Ashley Harvey, [email protected] wrote:

Tagging ./foo.txt as well as ../fooDir/foo.txt creates two database entries that point to the same file.

TMSU correctly identifies them as the same file (is this by inode, checksum, or some other method?); however, none of the resolution options make sense because they assume a different file (inode) that happens to be identical, rather than a a different path that happens to point to the same path.

I wonder if the same thing would happen with soft links. I haven't tested that yet.

Proposed resolution: using the inode, compare it to the duplicate, and if they match, merge the tags. and remove one of the db entries. Perhaps ask the user first. I can't imagine why you'd want two entries (with different paths) pointing to the same file, with different tags.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oniony/TMSU/issues/168, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAXBxVIAkOmXiPz0JGc0d4b3aM9lAO0ks5vYKFjgaJpZM4b7jTA .

oniony avatar Mar 19 '19 08:03 oniony

I agree with you.

The problem that arises would be if you're not using disparate hard links.

For example: /some/path/a and ../path/a (from within /some/path).

These are the same file and dupes identifies them as identical, yet a new entry is created.

I didn't expect TMSU to behave that way. Why I did it is another matter and to be honest I don't remember. I suspect I was trying to work around some other limitation - maybe I had moved files and the repair feature wasn't repairing successfully so I was re-tagging, that's the only case I can think of to do what I did.

Nevertheless here I am with lots of duplicate entries but not duplicate files and I want to merge them. I just figured if it happened to me it quite easily could happen to someone else and rather than just helping myself and creating a one-off SQL query to do it, it might be worth tweaking the source code.

ashleyharvey avatar Mar 19 '19 10:03 ashleyharvey

Hmm, I thought I took the canonical path.b I'll have a look at this.

On Tue, 19 Mar 2019, 10:20 Ashley Harvey, [email protected] wrote:

I agree with you.

The problem that arises would be if you're not using disparate hard links.

For example: /some/path/a and ../path/a (from within /some/path). These are the same file and dupes identifies them as identical, yet a new entry is created.

I didn't expect TMSU to behave that way. Why I did it is another matter and to be honest I don't really remember. I suspect I was trying to work around some other limitation - maybe I had moved files and the repair feature wasn't repairing successfully so I was re-tagging, that's the only case I can think of to do what I did.

Nevertheless here I am with lots of duplicate entries but not duplicate files and I want to merge them. I just figured if it happened to me it quite easily could happen to someone else and rather than just helping myself and creating a one-off SQL query to do it, it might be worth tweaking the source code.

On Mar 19, 2019, at 01:44, Paul Ruane [email protected] wrote:

I don't see this as a problem. TMSU works at the filename abstraction, not the inode level. Indeed many filesystems don't have inodes. Just because a filesystem supports hard links doesn't mean you want to treat the files as the same at the filename abstraction — if you did why would you bother creating the hard link in the first place?

Say you tag files /some/path/a and /some/path/b that happen to be hard links to the same inode. You may later on decide you want these two files to be different, so you delete /some/path/b and then create a new file there. TMSU will then recognise this change when you come to do a status or repair.

I think the TMSU would be more confusing if it treated two hard links as the same file.

On Tue, 19 Mar 2019, 08:35 Ashley Harvey, [email protected] wrote:

Tagging ./foo.txt as well as ../fooDir/foo.txt creates two database entries that point to the same file.

TMSU correctly identifies them as the same file (is this by inode, checksum, or some other method?); however, none of the resolution options make sense because they assume a different file (inode) that happens to be identical, rather than a a different path that happens to point to the same path.

I wonder if the same thing would happen with soft links. I haven't tested that yet.

Proposed resolution: using the inode, compare it to the duplicate, and if they match, merge the tags. and remove one of the db entries. Perhaps ask the user first. I can't imagine why you'd want two entries (with different paths) pointing to the same file, with different tags.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oniony/TMSU/issues/168, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAAXBxVIAkOmXiPz0JGc0d4b3aM9lAO0ks5vYKFjgaJpZM4b7jTA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oniony/TMSU/issues/168#issuecomment-474288584, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAXB7vfr6y59obxISSP3pUOgDMAyr3aks5vYLnugaJpZM4b7jTA .

oniony avatar Mar 19 '19 10:03 oniony

I also had a related issue, which I tracked down to the lack of canonical paths (not only absolute paths) in tmsu. Here is an explanation in case someone has the same problem or wants to reproduce it.

Symptoms: I had a file (let's call it foo) with a tag (mytag), as shown in tmsu files mytag. However, tmsu untag foo mytag reported it as untagged, and tmsu tags /path/to/foo didn't show any tag.

Explanation: the file is located on an external drive, mounted by Arch Linux on /run/media/ipkiss/partition. I have a symbolic link /mnt/partition pointing to /run/media/ipkiss/partition, since I find this more convenient. It turns out that the tool I used for tagging (xnviewmp) uses the canonical (/run/media/...) path,, so this is what is passes to tmsu. However, when running tmsu from the command-line I was in /mnt/partition, so all the file paths were considered relative to that, except when when the full canonical path is passed explicitly.

Root cause: I believe that the lack of canonical path (and not only absolute path) is the root cause here. If there was a unique, canonical representation of a file path, it could be used as a key, making it trivial to detect that a file is already present in the database. Such a canonical path would have at least to resolve symbolic links to avoid the problem I describe above.

Unfortunately, Go does not have a way to get a canonical path (at least it didn't in 2016, but I haven't found it in the current documentation either). I don't know if 3P libraries can provide this functionality.

Also note that Rust (considered in #74) has canonical paths.

Workaround: a query in the tmsu DB showed that about two thirds of my paths were canonical, and thus prone to this issue. To fix it I used the following steps:

  • Use tmsu-fs-mv to rename the folder (using relative paths, not canonical ones). After that, all the relative paths are adjusted in the DB, but canonical paths are broken since tmsu did not "understand" them.
  • Use tmsu repair new_name. This fixes all the broken canonical paths, using relative paths.
  • Use tmsu-fs-mv again to restore the original folder name. At this point all the canonical paths are gone, except those which were already broken before.

ipkiss42 avatar Jun 07 '20 09:06 ipkiss42

Inodes are a filesystem specific feature. TMSU works at the path level so that it can work across filesystems. It would resolve a path into its absolute form so that your example, where the path includes "..", should resolve to the same path. If it isn't then that's a bug.

The duplicate file detection is only there because TMSU is using file fingerprints to help repair its database when files are moved, and so that can be leveraged to detect duplicate files. I actually think, now, in retrospect, that it was a mistake adding that functionality as it's not core to what TMSU is doing. I don't think I'll be adding it in the Rust rewrite.

On Tue, 19 Mar 2019, 08:35 Ashley Harvey, [email protected] wrote:

Tagging ./foo.txt as well as ../fooDir/foo.txt creates two database entries that point to the same file.

TMSU correctly identifies them as the same file (is this by inode, checksum, or some other method?); however, none of the resolution options make sense because they assume a different file (inode) that happens to be identical, rather than a a different path that happens to point to the same path.

I wonder if the same thing would happen with soft links. I haven't tested that yet.

Proposed resolution: using the inode, compare it to the duplicate, and if they match, merge the tags. and remove one of the db entries. Perhaps ask the user first. I can't imagine why you'd want two entries (with different paths) pointing to the same file, with different tags.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oniony/TMSU/issues/168, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAXBxVIAkOmXiPz0JGc0d4b3aM9lAO0ks5vYKFjgaJpZM4b7jTA .

oniony avatar Jun 07 '20 10:06 oniony

Inodes are a filesystem specific feature. TMSU works at the path level so that it can work across filesystems. It would resolve a path into its absolute form so that your example, where the path includes "..", should resolve to the same path. If it isn't then that's a bug.

I agree with what you said, , but it sounds like you replied to the original comment, not to mine. I didn't mention inodes and I didn't use ".." in any path.

My point is that you can still have two (or more) absolute paths for the same file, via symbolic links. Without a canonical representation, this can cause issues like the one I described. I see that as a bug.

The duplicate file detection is only there because TMSU is using file fingerprints to help repair its database when files are moved, and so that can be leveraged to detect duplicate files. I actually think, now, in retrospect, that it was a mistake adding that functionality as it's not core to what TMSU is doing. I don't think I'll be adding it in the Rust rewrite.

No objection here. There are other tools specialized in that, such as rmlint.

ipkiss42 avatar Jun 07 '20 11:06 ipkiss42

Note to self: investigate whether there are cases where canonical path is not used when tagging files.

oniony avatar Sep 21 '20 12:09 oniony

Note to self: investigate whether there are cases where canonical path is not used when tagging files.

Just to be clear, I think that TMSU currently never uses canonical paths: it uses absolute paths instead.

The difference is that a canonical path is always absolute, but an absolute path is not necessarily canonical, if one of the path components is a symlink.

One tricky aspect is that the DB root should probably always be made canonical, but for paths it depends on the --no-dereference option. In particular, when --no-dereference is used, the intent is probably to dereference links outside the root but not inside. For example, if the file is /path/to/root/dir/file, a symlink for to should probably be followed, but not one for dir.

ipkiss42 avatar Sep 21 '20 13:09 ipkiss42