diskus
diskus copied to clipboard
Consider adding support for Windows directory size "philosophy"
Hey, first time trying diskus, and while I can confirm that it's really fast here as well, I get a different result in total bytes on a local directory. 😕
I've noticed the Windows caveat section, but I don't think that this applies in my case here, because there is noting unusual in this path, no junctions, or hardlinks whatsoever.
To be sure, I've tested the same path with some other tools, like the Python based duu^1, another one implemented in Rust found here on GitHub (dua^2), as well as the Sysinternals Disk Usage (du^3) tool for Windows for reference.
diskus
PS E:\> diskus Down
101.90 GB (101,896,640,783 bytes)
PS E:\> cd Down
PS E:\Down> diskus .
101.90 GB (101,896,640,783 bytes)
PS E:\Down>
Here's the comparison:
duu
PS E:\Down> duu --quiet
summary
=======
files : 3'919
directories : 125
bytes : 101'895'657'743
kilobytes : 99'507'478.26
megabytes : 97'175.27
gigabytes : 94.90
PS E:\Down>
dua
PS E:\Down> dua --format bytes '.'
101895657743 b . entries
PS E:\Down> '{0:N0}' -f 101895657743
101'895'657'743
PS E:\Down>
du
PS E:\Down> du E:\Down
Files: 3919
Directories: 125
Size: 101'895'657'743 bytes
Size on disk: 101'904'261'120 bytes
PS E:\Down>
And last but not least, the total value in bytes as displayed in Windows Explorer: 94.8 GB (101'895'657'743 bytes)
OS information:
PS E:\Down> [Environment]::OSVersion.VersionString
Microsoft Windows NT 10.0.19043.0
PS E:\Down> Get-WindowsVersion | select Version, OS* | fl
Version : 2009
OS Build : 19043.1526
PS E:\Down>
Interesting! Thank you for reporting this.
because there is noting unusual in this path, no junctions, or hardlinks whatsoever.
Are you sure? Looks like dua-cli
at least has some special handling for hard links. That could mean that diskus
is double-counting one (or multiple) files. Maybe you can narrow it down by running the comparison on subfolders? Or try to search for a file with a size of 101,896,640,783 - 101,895,657,743
bytes.
Yes, I'm sure, only files and directories.
But there's something else I noticed:
PS E:\> 101896640783 - 101895657743
983040
PS E:\> 983040/4096
240
PS E:\>
4096 bytes is the standard cluster size / default allocation unit size for NTFS (except for very large partitions).
I tested this with another directory (small, just 10 files), and I could observe the same phenomenon:
PS E:\> du Etc
Files: 10
Directories: 2
Size: 9'913 bytes
Size on disk: 86'016 bytes
PS E:\> diskus Etc
18.11 KB (18,105 bytes)
PS E:\> 18105-9913
8192
PS E:\> ls -force Etc
Directory: E:\Etc
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 31.12.2021 21:08 Menustore
PS E:\> du Etc\Menustore
Files: 10
Directories: 1
Size: 9'913 bytes
Size on disk: 81'920 bytes
PS E:\> diskus Etc\Menustore
14.01 KB (14,009 bytes)
PS E:\> 14009-9913
4096
PS E:\> 4096*2
8192
First test is the parent directory (Etc
) , second test against the contained subdirectory (Menustore
), which contains the 10 files.
So, in sum, as also printed with du, the former is 2 directories and 10 files, the latter is 1 directory and 10 files. In the first case, the reported difference by diskus is 8192 bytes, and the second difference is 4096 bytes. So, it seems to me that diskus is counting one cluster unit size per directory on top?
Thank you for looking into this further. Your analysis is spot on! I looked at the source code of duu
and dua-cli
. As far as I can tell, both only count the size of all FILES (or non-directories):
- https://github.com/jftuga/duu/blob/90d1ce8cbcfb7acb6b3c333a21434b4db0b15961/duu.py#L427-L434
- https://github.com/Byron/dua-cli/blob/b029dc5d190b23bf3e3fc95a3947f28f868e674e/src/traverse.rs#L115
diskus
, on the other hand, also adds the size of the directory entries themselves. Including the root directory:
- https://github.com/sharkdp/diskus/blob/9641caa46755160f49e123c6d226aa46c9cdde34/src/walk.rs#L28
On Linux, this is consistent with what du
does:
▶ mkdir test-directory
▶ touch test-directory/empty-file
▶ mkdir test-directory/empty-subdirectory
▶ echo -n "123" > test-directory/file-3-bytes
▶ echo -n "1234567" > test-directory/file-7-bytes
▶ du -s --block-size=1 test-directory; diskus test-directory
8192 test-directory
8.19 KB (8,192 bytes)
▶ du -s --apparent-size --block-size=1 test-directory; diskus --apparent-size test-directory
170 test-directory
170 B (170 bytes)
Compare this with duu.py
, which only sees the 3byte+7byte files:
▶ python duu.py --quiet test-directory
summary
=======
files : 3
directories : 2
bytes : 10
So I am not really sure how to proceed here. Apparently, the "real" Windows tools also seem to disregard the size of directories?!
See also: https://github.com/sharkdp/diskus/pull/49
Uh, well. I was quickly jotting down my reply, but probably should have stopped for a moment here, because I might've remembered that I was tripping over this before not that long ago..
So I am not really sure how to proceed here. Apparently, the "real" Windows tools also seem to disregard the size of directories?!
Yeah, that's exactly the issue. On Windows, by definition, a directory itself does not have a size. Or, phrased differently, the size of the directory alone is always zero...
It's a question of abstraction... without doing a deep dive on file systems here, the gist of the issue is that, on a technical level, a directory does have a size, obviously. But, due to how NTFS works, we have another famous case of my favorite problem class for the entirety of science (and engineering): counting, because counting is hard. The Sesame Street didn't teach us that the real crux is what/where/when to count...
In NTFS land, it's called the Master File Table (MFT), and this is where the space for the directories goes. NTFS has a reserved space for this, called the MFT Zone, I think, and while it's a part of the file system, it's not the, well, visible part of the file system.
To be honest, I don't know either what would be the best way to proceed here. You could expand the Windows caveat section, to mention that we're following Unix here and count like du does, or add special case handling for Windows. I don't know, it's your project, so it's up to you 😄
Edit: I think this issue needs a more appropriate name. I'll give it a shot.
Yeah, that's exactly the issue. On Windows, by definition, a directory itself does not have a size. Or, phrased differently, the size of the directory alone is always zero...
I think it is important to distinguish between "disk usage" and "apparent size" here. GNU du
typically shows "disk usage" (as the name implies), but can be switched to compute "apparent sizes". Both du
as well as diskus
compute these two quantities based on st_size
and st_blocks
, two fields returned from a stat
syscall:
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* Inode number */
mode_t st_mode; /* File type and mode */
nlink_t st_nlink; /* Number of hard links */
uid_t st_uid; /* User ID of owner */
gid_t st_gid; /* Group ID of owner */
dev_t st_rdev; /* Device ID (if special file) */
off_t st_size; /* Total size, in bytes */
blksize_t st_blksize; /* Block size for filesystem I/O */
blkcnt_t st_blocks; /* Number of 512B blocks allocated */
/* Since Linux 2.6, the kernel supports nanosecond
precision for the following timestamp fields.
For the details before Linux 2.6, see NOTES. */
struct timespec st_atim; /* Time of last access */
struct timespec st_mtim; /* Time of last modification */
struct timespec st_ctim; /* Time of last status change */
#define st_atime st_atim.tv_sec /* Backward compatibility */
#define st_mtime st_mtim.tv_sec
#define st_ctime st_ctim.tv_sec
};
st_size
This field gives the size of the file (if it is a regu‐
lar file or a symbolic link) in bytes. The size of a
symbolic link is the length of the pathname it contains,
without a terminating null byte.
st_blocks
This field indicates the number of blocks allocated to
the file, in 512-byte units. (This may be smaller than
st_size/512 when the file has holes.)
where:
disk usage = 512 * st_blocks
apparent size = st_size
The following table shows the difference between these two quantities (on a filesystem with 4KiB block size):
Disk usage | Apparent size | |
---|---|---|
empty directory | 0 B | 40 B |
directory with empty subdirectory | 0 B | 100 B |
file-0-bytes | 0 B | 0 B |
file-3-bytes | 4096 B | 3 B |
file-4096-bytes | 4096 B | 4096 B |
file-4097-bytes | 8192 B | 4097 B |
symlink to "file-3-bytes" | 0 B | 12 B |
It looks to me like the number you are getting on Windows is neither disk usage, nor apparent size. It's the "apparent size of all files, excluding directories".
The Windows sysinternals du
tool seems to report both "Size" and "Size on disk" (similar to what Windows Explorer does, I guess?). For some reason though, the "size on disk" number is even higher than what I would expect from a "disk usage" option. In my integration test, the output on the test directory which contains a 0B, a 3B and a 7B file as well as an empty directory, is the following:
diskus
:
4106
(which is 4096+10
)
sysinternals du
:
Size: 10 bytes
Size on disk: 24,576 bytes
I have no idea where the latter number of "24,576 = 6 × 4,096" bytes comes from.
But I agree with you that something should be changed on Windows. Maybe we could introduce a --include-directory-size=never/always/auto
option where auto
would be the default and select always
on Linux and never
on Windows?
The Windows sysinternals
du
tool seems to report both "Size" and "Size on disk" (similar to what Windows Explorer does, I guess?).
Yes, this is true, du.exe
always shows the same values as Windows does in the File Explorer, for example. Both "Size" and "Size on Disk".
For some reason though, the "size on disk" number is even higher than what I would expect from a "disk usage" option. In my integration test, the output on the test directory which contains a 0B, a 3B and a 7B file as well as an empty directory, is the following:
diskus
:4106
(which is
4096+10
)sysinternals
du
:Size: 10 bytes Size on disk: 24,576 bytes
I have no idea where the latter number of "24,576 = 6 × 4,096" bytes comes from.
Good question,,
PS D:\Temp> ls -fo
PS D:\Temp> mkdir 'emptydir'
Directory: D:\Temp
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 19.02.2022 15:09 emptydir
PS D:\Temp> 'a' > 3by; 'aaaaa' > 7by
PS D:\Temp> new-item 0by
Directory: D:\Temp
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 19.02.2022 15:10 0 0by
PS D:\Temp> ls
Directory: D:\Temp
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 19.02.2022 15:09 emptydir
-a--- 19.02.2022 15:10 0 0by
-a--- 19.02.2022 15:10 3 3by
-a--- 19.02.2022 15:10 7 7by
PS D:\Temp> cd ..
PS D:\> du Temp
Files: 3
Directories: 2
Size: 10 bytes
Size on disk: 24'576 bytes
PS D:\>
I was thinking about the parent directory itself, maybe, but it doesn't really add up. An empty directory in an empty directory (not really empty anymore now) is as expected, and the parent directory does not count:
Files: 0
Directories: 2
Size: 0 bytes
Size on disk: 4'096 bytes
But I agree with you that something should be changed on Windows. Maybe we could introduce a
--include-directory-size=never/always/auto
option whereauto
would be the default and selectalways
on Linux andnever
on Windows?
Sounds good to me!
Windows 7 x64, Diskus 0.7.0
Let's get the folder size in bytes.
$ for /f "tokens=1,2 delims=: " %a in ('robocopy C:\Test . /L /BYTES /S /NJH /NDL /NFL /XJ /R:0 /W:0') do @if /i %a==Bytes echo %b
3012235061
$ pwsh -c "(gci -lp C:\Test -r -force | measure -p length -sum).sum"
3012235061
$ coreutils du -bs C:\Test
3012235061 C:\Test
$ duu -q C:\Test
summary
=======
files : 60 531
directories : 7 136
bytes : 3 012 235 061
kilobytes : 2 941 635,80
megabytes : 2 872,69
gigabytes : 2,81
So far so good, the output matches. But what does Diskus show?
$ diskus.exe --size-format decimal C:\Test
3.04 GB (3,036,213,045 bytes)
$ diskus.exe --size-format binary C:\Test
2.83 GiB (3,036,213,045 bytes)
3 012 235 061 vs 3 036 213 045.
Something is wrong, indeed.
Err… Hello?
Yes? Did you read the entire issue here? It should explain everything.