dust
dust copied to clipboard
Clone should probably be ignore / size computed differently
Hello,
First, thx for this amazing tools. Thanks to it I was able to clean my computer from a lot of bull***
I open this issue on a subject i don't rely understand by I hope it can help.
Thx to bsd/apple clone things, i am able to clone dir/ file with cp -c .
Clone are, as understand it, a kind of weird hard link but when you write over it, it save the diff.
In my example, you can see that I clone the dankest movie of my library few time and df -h
doesn't report a disk usage difference.
dust repeat the clones as there take more space on disk.
[Movies] df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1s1 466Gi 10Gi 90Gi 11% 484283 4881968597 0% /
devfs 403Ki 403Ki 0Bi 100% 1404 0 100% /dev
/dev/disk1s2 466Gi 352Gi 90Gi 80% 3172969 4879279911 0% /System/Volumes/Data
/dev/disk1s5 466Gi 12Gi 90Gi 12% 12 4882452868 0% /private/var/vm
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /System/Volumes/Data/home
/dev/disk2s2 105Mi 105Mi 0Bi 100% 3 4294967276 0% /Volumes/Install Google Drive File Stream
drivefs 30Gi 7.0Gi 23Gi 24% 18446744069414596880 4294967295 146880675765702656% /Volumes/GoogleDrive
drivefs 30Gi 7.0Gi 23Gi 24% 18446744069414740697 4294967295 11796403584574832% /Volumes/GoogleDrive
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone1
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone2
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone3
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone4
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone5
[Movies] cp -cR Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT clone6
[Movies] df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1s1 466Gi 10Gi 90Gi 11% 484283 4881968597 0% /
devfs 403Ki 403Ki 0Bi 100% 1404 0 100% /dev
/dev/disk1s2 466Gi 352Gi 90Gi 80% 3172987 4879279893 0% /System/Volumes/Data
/dev/disk1s5 466Gi 12Gi 90Gi 12% 12 4882452868 0% /private/var/vm
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /System/Volumes/Data/home
/dev/disk2s2 105Mi 105Mi 0Bi 100% 3 4294967276 0% /Volumes/Install Google Drive File Stream
drivefs 30Gi 7.0Gi 23Gi 24% 18446744069414596880 4294967295 146880675765702656% /Volumes/GoogleDrive
drivefs 30Gi 7.0Gi 23Gi 24% 18446744069414740697 4294967295 11796403584574832% /Volumes/GoogleDrive
[Movies] dust
46G ─┬ .
24G ├─┬ Star.Wars.The.Clone.Wars.S01.1080p.BluRay.x264-FLHD[rartv]
1.1G │ ├── Star.Wars.The.Clone.Wars.S01E22.1080p.BluRay.x264-FLHD.mkv
1.1G │ ├── Star.Wars.The.Clone.Wars.S01E20.1080p.BluRay.x264-FLHD.mkv
1.1G │ └── Star.Wars.The.Clone.Wars.S01E16.1080p.BluRay.x264-FLHD.mkv
2.9G ├─┬ Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone1
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone2
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone3
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone4
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone5
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
2.9G ├─┬ clone6
2.9G │ └── Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
1.4G └─┬ Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97
1.4G └── Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97.mkv
I never ever developed in rust except 10 minute ago while try to see if metadata and filetype structure could help us here. seems not.
Actually, I have no clue to know if a file is a clone or not : (https://stackoverflow.com/questions/46417747/apple-file-system-apfs-check-if-file-is-a-clone-on-terminal-shell)
Would be glad to help you if you don't have any OSx to try thing out, but I don't think I would be able to PR anything.
Best Regards,
Hey,
Thanks, Nice fine.
From reading about apple copy it appears that the equivalent is cp reflink=always
in linux. Sadly my linux install doesn't support this :-(. I do not have OSX.
from man cp:
--reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails
Can you please run 'ls -li' on the directory with the original and cloned object - I am curious to know if the inodes (first number in the column) are different on the cloned object
Might be able to fix this issue using this: http://m4rw3r.github.io/rust/std/os/macos/fs/trait.MetadataExt.html
Here is the result of the ls -li (inside Movie dir, Source dir, and one of the target dir)
[Movies] ls -li
total 0
19238843 drwxr-xr-x 4 antoine staff 128 Mar 11 23:14 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
8436613 drwxr-xr-x 27 antoine staff 864 Jan 6 23:40 Star.Wars.The.Clone.Wars.S01.1080p.BluRay.x264-FLHD[rartv]
14311823 drwxr-xr-x 5 antoine staff 160 Mar 9 11:21 TV
7932834 drwx------ 4 antoine staff 128 Dec 20 2017 Zero.Dark.Thirty.2012.1080p.BluRay.x265.10bit-z97
21151541 drwxr-xr-x 4 antoine staff 128 Mar 22 14:07 clone1
21151546 drwxr-xr-x 4 antoine staff 128 Mar 22 14:07 clone2
21151556 drwxr-xr-x 4 antoine staff 128 Mar 22 14:07 clone3
21151566 drwxr-xr-x 4 antoine staff 128 Mar 22 14:07 clone4
21151572 drwxr-xr-x 4 antoine staff 128 Mar 22 14:07 clone5
603418 drwxr-xr-x 76 antoine staff 2432 Mar 18 18:07 funnyWEBM
[Movies] ls -li Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT
total 6104072
19242305 -rw-r--r-- 1 antoine staff 31 Mar 11 23:14 RARBG.txt
19238844 -rw-r--r--@ 1 antoine staff 3123007576 Mar 11 23:22 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
[Movies] ls -li clone1
total 6104072
21151542 -rw-r--r-- 1 antoine staff 31 Mar 11 23:14 RARBG.txt
21151543 -rw-r--r--@ 1 antoine staff 3123007576 Mar 11 23:22 Sonic.the.Hedgehog.2020.720p.HDRip.XviD.MP3-STUTTERSHIT.avi
I don't think it's the "same" since it's base on a feature from Apple File Systeme but probably the same idea.
Here is an extract of the man page on my MAC OS X. can't find it anyware only.
-c copy files using clonefile(2)
Will check the MetadataExt
Those files have different inodes so I assume they are different files. There might be something we can plugin to in the macos specific metadataext: http://m4rw3r.github.io/rust/std/os/macos/fs/trait.MetadataExt.html in order to detect this but whoever fixes this will likely need a mac.
As I said i'm not use to Rust at all.
I don't realy undertstand how to use trait:
#![allow(unused)]
fn main() -> std::io::Result<()> {
use std::os::macos::fs::MetadataExt;
use std::fs;
let metadata = fs::metadata("/Users/antoine/Movies/clone1")?;
impl MetadataExt for Metadata;
println!("{:?}", metadata);
Ok(())
}
what am i doing with this ?
remove this: impl MetadataExt for Metadata;
add this:
println!("{:?}", metadata.as_raw_stat());
and print that data for a regular file, a cloned file, and a different file and see if you can work out what the difference is between them.
This might be a bit beyond you if you aren't a rust user so don't worry too much.
I have this while compiling:
--> test.rs:8:22
|
8 | println!("{:?}", metadata.as_raw_stat());
| ^^^^^^^^^^^^^^^^^^^^^^ `std::os::macos::raw::stat` cannot be formatted using `{:?}` because it doesn't implement `std::fmt::Debug`
|
= help: the trait `std::fmt::Debug` is not implemented for `std::os::macos::raw::stat`
= note: required because of the requirements on the impl of `std::fmt::Debug` for `&std::os::macos::raw::stat`
= note: required by `std::fmt::Debug::fmt`
error: aborting due to previous error
For more information about this error, try `rustc --explain E0277`.
I try to not format the string but I receive this one.
--> test.rs:8:22
|
8 | println(metadata.as_raw_stat());
| ^^^^^^^^^^^
|
= note: `#[warn(deprecated)]` on by default
error: aborting due to previous error
For more information about this error, try `rustc --explain E0423`.
ok, so it doesn't implement debug so you'd have to print each of the fields out manually.
I don't know if this help
[test] ./test
st_dev :16777221
st_uid :502
st_mode :16877
st_nlink:4
st_ino:21151541
st_uid: 502
st_gid: 20
st_rdev:0
st_atime:1584882512
st_atime_nsec:325028357
st_mtime:1584882447
st_mtime_nsec:667178615
st_ctime:1584882447
st_ctime_nsec:667178615
st_birthtime:1584882447
st_birthtime_nsec:664890469
st_size:128
st_blocks:0
st_blksize:4096
st_flags:0
st_gen:0
st_lspare:0
st_qspare:[0,0]
But i get this warning while compiling (for each field)
warning: use of deprecated item 'std::os::macos::raw::stat::st_qspare': these type aliases are no longer supported by the standard library, the `libc` crate on crates.io should be used instead for the correct definitions
I will try to check more deeply tonight.
I'll close this issue unless I hear anything in the next few days.
It looks like the raw metadata structure does not provide info to detect cloned files on macOS.
I'm not sure what levels of cloning are supported on APFS, but on Linux XFS & BTRFS, cloning is a block- or extent-level operation, not a file-level operation; if part of a file has been modified, then some blocks may be cloned but others not. Fully detecting clones, if it is even possible, is likely to require scanning a structure listing the blocks in the file. From clonefile(2)
on my mac, it looks like APFS probably works similarly.
It would be very helpful for dust to have better support for these files, but I expect doing so is rather difficult. This blog post discusses one adventure in trying to detect clones; looks like you can detect that a file may have been cloned, but not necessarily that it is cloned.
Thanks for the information @mdekstrand it is interesting.
Sadly, I think this is going to be too tricky to solve.
I'm going to leave this here, in case someone comes along and does want to try to work on this issue: on Linux, it looks like the FIEMAP
ioctl is the way to obtain the detailed extent data needed to detect sharing.
I continue the conversation, I think it's interesting. Don't you think it should be the responsibility of the file system to return the "real size" value of a file ? And not to each app to "manage" each file system.
@antoineVerlant I don't think that would make sense for this problem. In the case of cloned extents, each file is its real size — stat
returns a size, and that is the size of the file. The problem is that the two files together take less space than it looks like by adding their sizes. But only the user-space program knows what files are in the set it is considering. If you run dust
on a directory, and some of the files are clones of files in other directories not included in the dust
run, then their full sizes should be reported; only when there are clones within the set of files counted in a dust
run should dust
consider accounting for the shared space. The operating system has no way to account for that, unless it is augmented with complex system calls to obtain detailed space usage across directory trees.
Other seemingly-related problems are much easier to handle. Hard-links can be detected by comparing inodes, and only counting a file the first time its inode is seen. Sparse files have their actual space used reported by the operating system. It's just clones that are the tricky problem (at least among the kinds of problems a program like dust
is likely to encounter).