dust icon indicating copy to clipboard operation
dust copied to clipboard

Clone should probably be ignore / size computed differently

Open Babwin opened this issue 4 years ago • 9 comments

Hello,

First, thx for this amazing tools. Thanks to it I was able to clean my computer from a lot of bull***

I open this issue on a subject i don't rely understand by I hope it can help.

Thx to bsd/apple clone things, i am able to clone dir/ file with cp -c .

Clone are, as understand it, a kind of weird hard link but when you write over it, it save the diff.

In my example, you can see that I clone the dankest movie of my library few time and df -h doesn't report a disk usage difference.

dust repeat the clones as there take more space on disk.

[Movies] df -h
Filesystem      Size   Used  Avail Capacity     iused      ifree %iused  Mounted on
/dev/disk1s1   466Gi   10Gi   90Gi    11%      484283 4881968597    0%   /
devfs          403Ki  403Ki    0Bi   100%        1404          0  100%   /dev
/dev/disk1s2   466Gi  352Gi   90Gi    80%     3172969 4879279911    0%   /System/Volumes/Data
/dev/disk1s5   466Gi   12Gi   90Gi    12%          12 4882452868    0%   /private/var/vm
map auto_home    0Bi    0Bi    0Bi   100%           0          0  100%   /System/Volumes/Data/home
/dev/disk2s2   105Mi  105Mi    0Bi   100%           3 4294967276    0%   /Volumes/Install Google Drive File Stream
drivefs         30Gi  7.0Gi   23Gi    24% 18446744069414596880 4294967295 146880675765702656%   /Volumes/GoogleDrive
drivefs         30Gi  7.0Gi   23Gi    24% 18446744069414740697 4294967295 11796403584574832%   /Volumes/GoogleDrive
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone1
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone2
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone3
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone4
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone5
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone6
[Movies] df -h
Filesystem      Size   Used  Avail Capacity     iused      ifree %iused  Mounted on
/dev/disk1s1   466Gi   10Gi   90Gi    11%      484283 4881968597    0%   /
devfs          403Ki  403Ki    0Bi   100%        1404          0  100%   /dev
/dev/disk1s2   466Gi  352Gi   90Gi    80%     3172987 4879279893    0%   /System/Volumes/Data
/dev/disk1s5   466Gi   12Gi   90Gi    12%          12 4882452868    0%   /private/var/vm
map auto_home    0Bi    0Bi    0Bi   100%           0          0  100%   /System/Volumes/Data/home
/dev/disk2s2   105Mi  105Mi    0Bi   100%           3 4294967276    0%   /Volumes/Install Google Drive File Stream
drivefs         30Gi  7.0Gi   23Gi    24% 18446744069414596880 4294967295 146880675765702656%   /Volumes/GoogleDrive
drivefs         30Gi  7.0Gi   23Gi    24% 18446744069414740697 4294967295 11796403584574832%   /Volumes/GoogleDrive
[Movies] dust
  46G ─┬ .
  24G  ├─┬ Star.Wars.The.Clone.Wars.S01.1080p.BluRay.x264-FLHD[rartv]
 1.1G  │ ├── Star.Wars.The.Clone.Wars.S01E22.1080p.BluRay.x264-FLHD.mkv
 1.1G  │ ├── Star.Wars.The.Clone.Wars.S01E20.1080p.BluRay.x264-FLHD.mkv
 1.1G  │ └── Star.Wars.The.Clone.Wars.S01E16.1080p.BluRay.x264-FLHD.mkv
 2.9G  ├─┬ Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone1
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone2
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone3
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone4
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone5
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 2.9G  ├─┬ clone6
 2.9G  │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
 1.4G  └─┬ Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97
 1.4G    └── Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97.mkv

I never ever developed in rust except 10 minute ago while try to see if metadata and filetype structure could help us here. seems not.

Actually, I have no clue to know if a file is a clone or not : (https://stackoverflow.com/questions/46417747/apple-file-system-apfs-check-if-file-is-a-clone-on-terminal-shell)

Would be glad to help you if you don't have any OSx to try thing out, but I don't think I would be able to PR anything.

Best Regards,

Babwin avatar Mar 20 '20 21:03 Babwin

Hey,

Thanks, Nice fine.

From reading about apple copy it appears that the equivalent is cp reflink=always in linux. Sadly my linux install doesn't support this :-(. I do not have OSX.

from man cp: --reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails

Can you please run 'ls -li' on the directory with the original and cloned object - I am curious to know if the inodes (first number in the column) are different on the cloned object

bootandy avatar Mar 21 '20 19:03 bootandy

Might be able to fix this issue using this: http://m4rw3r.github.io/rust/std/os/macos/fs/trait.MetadataExt.html

bootandy avatar Mar 21 '20 19:03 bootandy

Here is the result of the ls -li (inside Movie dir, Source dir, and one of the target dir)

[Movies] ls -li
total 0
19238843 drwxr-xr-x   4 antoine  staff   128 Mar 11 23:14 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
 8436613 drwxr-xr-x  27 antoine  staff   864 Jan  6 23:40 Star.Wars.The.Clone.Wars.S01.1080p.BluRay.x264-FLHD[rartv]
14311823 drwxr-xr-x   5 antoine  staff   160 Mar  9 11:21 TV
 7932834 drwx------   4 antoine  staff   128 Dec 20  2017 Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97
21151541 drwxr-xr-x   4 antoine  staff   128 Mar 22 14:07 clone1
21151546 drwxr-xr-x   4 antoine  staff   128 Mar 22 14:07 clone2
21151556 drwxr-xr-x   4 antoine  staff   128 Mar 22 14:07 clone3
21151566 drwxr-xr-x   4 antoine  staff   128 Mar 22 14:07 clone4
21151572 drwxr-xr-x   4 antoine  staff   128 Mar 22 14:07 clone5
  603418 drwxr-xr-x  76 antoine  staff  2432 Mar 18 18:07 funnyWEBM
[Movies] ls -li Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
total 6104072
19242305 -rw-r--r--  1 antoine  staff          31 Mar 11 23:14 RARBG.txt
19238844 -rw-r--r--@ 1 antoine  staff  3123007576 Mar 11 23:22 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
[Movies] ls -li clone1
total 6104072
21151542 -rw-r--r--  1 antoine  staff          31 Mar 11 23:14 RARBG.txt
21151543 -rw-r--r--@ 1 antoine  staff  3123007576 Mar 11 23:22 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi

I don't think it's the "same" since it's base on a feature from Apple File Systeme but probably the same idea.

Here is an extract of the man page on my MAC OS X. can't find it anyware only.

-c    copy files using clonefile(2)

Will check the MetadataExt

antoineVerlant avatar Mar 22 '20 13:03 antoineVerlant

Those files have different inodes so I assume they are different files. There might be something we can plugin to in the macos specific metadataext: http://m4rw3r.github.io/rust/std/os/macos/fs/trait.MetadataExt.html in order to detect this but whoever fixes this will likely need a mac.

bootandy avatar Mar 22 '20 15:03 bootandy

As I said i'm not use to Rust at all.

I don't realy undertstand how to use trait:

#![allow(unused)]
fn main() -> std::io::Result<()> {
    use std::os::macos::fs::MetadataExt;
    use std::fs;

    let metadata = fs::metadata("/Users/antoine/Movies/clone1")?;


    impl MetadataExt for Metadata;


    println!("{:?}", metadata);


    Ok(())

}

what am i doing with this ?

antoineVerlant avatar Mar 22 '20 20:03 antoineVerlant

remove this: impl MetadataExt for Metadata;

add this: println!("{:?}", metadata.as_raw_stat());

and print that data for a regular file, a cloned file, and a different file and see if you can work out what the difference is between them.

This might be a bit beyond you if you aren't a rust user so don't worry too much.

bootandy avatar Mar 22 '20 21:03 bootandy

I have this while compiling:

 --> test.rs:8:22
  |
8 |     println!("{:?}", metadata.as_raw_stat());
  |                      ^^^^^^^^^^^^^^^^^^^^^^ `std::os::macos::raw::stat` cannot be formatted using `{:?}` because it doesn't implement `std::fmt::Debug`
  |
  = help: the trait `std::fmt::Debug` is not implemented for `std::os::macos::raw::stat`
  = note: required because of the requirements on the impl of `std::fmt::Debug` for `&std::os::macos::raw::stat`
  = note: required by `std::fmt::Debug::fmt`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0277`.

I try to not format the string but I receive this one.

 --> test.rs:8:22
  |
8 |     println(metadata.as_raw_stat());
  |                      ^^^^^^^^^^^
  |
  = note: `#[warn(deprecated)]` on by default

error: aborting due to previous error

For more information about this error, try `rustc --explain E0423`.

antoineVerlant avatar Mar 22 '20 23:03 antoineVerlant

ok, so it doesn't implement debug so you'd have to print each of the fields out manually.

bootandy avatar Mar 24 '20 22:03 bootandy

I don't know if this help

[test] ./test
st_dev :16777221
st_uid :502
st_mode :16877
st_nlink:4
st_ino:21151541
st_uid: 502
st_gid: 20
st_rdev:0
st_atime:1584882512
st_atime_nsec:325028357
st_mtime:1584882447
st_mtime_nsec:667178615
st_ctime:1584882447
st_ctime_nsec:667178615
st_birthtime:1584882447
st_birthtime_nsec:664890469
st_size:128
st_blocks:0
st_blksize:4096
st_flags:0
st_gen:0
st_lspare:0
st_qspare:[0,0]

But i get this warning while compiling (for each field)

warning: use of deprecated item 'std::os::macos::raw::stat::st_qspare': these type aliases are no longer supported by the standard library, the `libc` crate on crates.io should be used instead for the correct definitions

I will try to check more deeply tonight.

Babwin avatar Mar 25 '20 09:03 Babwin

I'll close this issue unless I hear anything in the next few days.

bootandy avatar Aug 23 '22 08:08 bootandy

It looks like the raw metadata structure does not provide info to detect cloned files on macOS.

I'm not sure what levels of cloning are supported on APFS, but on Linux XFS & BTRFS, cloning is a block- or extent-level operation, not a file-level operation; if part of a file has been modified, then some blocks may be cloned but others not. Fully detecting clones, if it is even possible, is likely to require scanning a structure listing the blocks in the file. From clonefile(2) on my mac, it looks like APFS probably works similarly.

It would be very helpful for dust to have better support for these files, but I expect doing so is rather difficult. This blog post discusses one adventure in trying to detect clones; looks like you can detect that a file may have been cloned, but not necessarily that it is cloned.

mdekstrand avatar Sep 09 '22 15:09 mdekstrand

Thanks for the information @mdekstrand it is interesting.

Sadly, I think this is going to be too tricky to solve.

bootandy avatar Sep 11 '22 07:09 bootandy

I'm going to leave this here, in case someone comes along and does want to try to work on this issue: on Linux, it looks like the FIEMAP ioctl is the way to obtain the detailed extent data needed to detect sharing.

mdekstrand avatar Sep 12 '22 15:09 mdekstrand

I continue the conversation, I think it's interesting. Don't you think it should be the responsibility of the file system to return the "real size" value of a file ? And not to each app to "manage" each file system.

antoineVerlant avatar Sep 13 '22 11:09 antoineVerlant

@antoineVerlant I don't think that would make sense for this problem. In the case of cloned extents, each file is its real size — stat returns a size, and that is the size of the file. The problem is that the two files together take less space than it looks like by adding their sizes. But only the user-space program knows what files are in the set it is considering. If you run dust on a directory, and some of the files are clones of files in other directories not included in the dust run, then their full sizes should be reported; only when there are clones within the set of files counted in a dust run should dust consider accounting for the shared space. The operating system has no way to account for that, unless it is augmented with complex system calls to obtain detailed space usage across directory trees.

Other seemingly-related problems are much easier to handle. Hard-links can be detected by comparing inodes, and only counting a file the first time its inode is seen. Sparse files have their actual space used reported by the operating system. It's just clones that are the tricky problem (at least among the kinds of problems a program like dust is likely to encounter).

mdekstrand avatar Sep 13 '22 15:09 mdekstrand