diskus icon indicating copy to clipboard operation
diskus copied to clipboard

Consider adding support for Windows directory size "philosophy"

Open Hrxn opened this issue 2 years ago • 10 comments

Hey, first time trying diskus, and while I can confirm that it's really fast here as well, I get a different result in total bytes on a local directory. 😕

I've noticed the Windows caveat section, but I don't think that this applies in my case here, because there is noting unusual in this path, no junctions, or hardlinks whatsoever.

To be sure, I've tested the same path with some other tools, like the Python based duu^1, another one implemented in Rust found here on GitHub (dua^2), as well as the Sysinternals Disk Usage (du^3) tool for Windows for reference.

diskus
PS E:\> diskus Down
101.90 GB (101,896,640,783 bytes)
PS E:\> cd Down
PS E:\Down> diskus .
101.90 GB (101,896,640,783 bytes)
PS E:\Down>

Here's the comparison:

duu
PS E:\Down> duu --quiet

summary
=======
files         : 3'919
directories   : 125
bytes         : 101'895'657'743
kilobytes     : 99'507'478.26
megabytes     : 97'175.27
gigabytes     : 94.90
PS E:\Down>
dua
PS E:\Down> dua --format bytes '.'
101895657743 b . entries
PS E:\Down> '{0:N0}' -f 101895657743
101'895'657'743
PS E:\Down>
du
PS E:\Down> du E:\Down
Files:        3919
Directories:  125
Size:         101'895'657'743 bytes
Size on disk: 101'904'261'120 bytes

PS E:\Down>

And last but not least, the total value in bytes as displayed in Windows Explorer: 94.8 GB (101'895'657'743 bytes)

OS information:

PS E:\Down> [Environment]::OSVersion.VersionString
Microsoft Windows NT 10.0.19043.0
PS E:\Down> Get-WindowsVersion | select Version, OS* | fl

Version  : 2009
OS Build : 19043.1526

PS E:\Down>

Hrxn avatar Feb 17 '22 01:02 Hrxn

Interesting! Thank you for reporting this.

because there is noting unusual in this path, no junctions, or hardlinks whatsoever.

Are you sure? Looks like dua-cli at least has some special handling for hard links. That could mean that diskus is double-counting one (or multiple) files. Maybe you can narrow it down by running the comparison on subfolders? Or try to search for a file with a size of 101,896,640,783 - 101,895,657,743 bytes.

sharkdp avatar Feb 17 '22 12:02 sharkdp

Yes, I'm sure, only files and directories.

But there's something else I noticed:

PS E:\> 101896640783 - 101895657743
983040
PS E:\> 983040/4096
240
PS E:\>

4096 bytes is the standard cluster size / default allocation unit size for NTFS (except for very large partitions).

I tested this with another directory (small, just 10 files), and I could observe the same phenomenon:

PS E:\> du Etc
Files:        10
Directories:  2
Size:         9'913 bytes
Size on disk: 86'016 bytes

PS E:\> diskus Etc
18.11 KB (18,105 bytes)
PS E:\> 18105-9913
8192
PS E:\> ls -force Etc

    Directory: E:\Etc

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          31.12.2021    21:08                Menustore

PS E:\> du Etc\Menustore
Files:        10
Directories:  1
Size:         9'913 bytes
Size on disk: 81'920 bytes

PS E:\> diskus Etc\Menustore
14.01 KB (14,009 bytes)
PS E:\> 14009-9913
4096
PS E:\> 4096*2
8192

First test is the parent directory (Etc) , second test against the contained subdirectory (Menustore), which contains the 10 files.

So, in sum, as also printed with du, the former is 2 directories and 10 files, the latter is 1 directory and 10 files. In the first case, the reported difference by diskus is 8192 bytes, and the second difference is 4096 bytes. So, it seems to me that diskus is counting one cluster unit size per directory on top?

Hrxn avatar Feb 17 '22 17:02 Hrxn

Thank you for looking into this further. Your analysis is spot on! I looked at the source code of duu and dua-cli. As far as I can tell, both only count the size of all FILES (or non-directories):

  • https://github.com/jftuga/duu/blob/90d1ce8cbcfb7acb6b3c333a21434b4db0b15961/duu.py#L427-L434
  • https://github.com/Byron/dua-cli/blob/b029dc5d190b23bf3e3fc95a3947f28f868e674e/src/traverse.rs#L115

diskus, on the other hand, also adds the size of the directory entries themselves. Including the root directory:

  • https://github.com/sharkdp/diskus/blob/9641caa46755160f49e123c6d226aa46c9cdde34/src/walk.rs#L28

On Linux, this is consistent with what du does:

▶ mkdir test-directory
▶ touch test-directory/empty-file
▶ mkdir test-directory/empty-subdirectory
▶ echo -n "123" > test-directory/file-3-bytes
▶ echo -n "1234567" > test-directory/file-7-bytes

▶ du -s --block-size=1 test-directory; diskus test-directory          
8192	test-directory
8.19 KB (8,192 bytes)

▶ du -s --apparent-size --block-size=1 test-directory; diskus --apparent-size test-directory
170	test-directory
170 B (170 bytes)

Compare this with duu.py, which only sees the 3byte+7byte files:

▶ python duu.py --quiet test-directory

summary
=======
files         : 3
directories   : 2
bytes         : 10

So I am not really sure how to proceed here. Apparently, the "real" Windows tools also seem to disregard the size of directories?!

See also: https://github.com/sharkdp/diskus/pull/49

sharkdp avatar Feb 18 '22 21:02 sharkdp

Uh, well. I was quickly jotting down my reply, but probably should have stopped for a moment here, because I might've remembered that I was tripping over this before not that long ago..

So I am not really sure how to proceed here. Apparently, the "real" Windows tools also seem to disregard the size of directories?!

Yeah, that's exactly the issue. On Windows, by definition, a directory itself does not have a size. Or, phrased differently, the size of the directory alone is always zero...

It's a question of abstraction... without doing a deep dive on file systems here, the gist of the issue is that, on a technical level, a directory does have a size, obviously. But, due to how NTFS works, we have another famous case of my favorite problem class for the entirety of science (and engineering): counting, because counting is hard. The Sesame Street didn't teach us that the real crux is what/where/when to count...

In NTFS land, it's called the Master File Table (MFT), and this is where the space for the directories goes. NTFS has a reserved space for this, called the MFT Zone, I think, and while it's a part of the file system, it's not the, well, visible part of the file system.

To be honest, I don't know either what would be the best way to proceed here. You could expand the Windows caveat section, to mention that we're following Unix here and count like du does, or add special case handling for Windows. I don't know, it's your project, so it's up to you 😄

Edit: I think this issue needs a more appropriate name. I'll give it a shot.

Hrxn avatar Feb 19 '22 01:02 Hrxn

Yeah, that's exactly the issue. On Windows, by definition, a directory itself does not have a size. Or, phrased differently, the size of the directory alone is always zero...

I think it is important to distinguish between "disk usage" and "apparent size" here. GNU du typically shows "disk usage" (as the name implies), but can be switched to compute "apparent sizes". Both du as well as diskus compute these two quantities based on st_size and st_blocks, two fields returned from a stat syscall:

struct stat {
   dev_t     st_dev;         /* ID of device containing file */
   ino_t     st_ino;         /* Inode number */
   mode_t    st_mode;        /* File type and mode */
   nlink_t   st_nlink;       /* Number of hard links */
   uid_t     st_uid;         /* User ID of owner */
   gid_t     st_gid;         /* Group ID of owner */
   dev_t     st_rdev;        /* Device ID (if special file) */
   off_t     st_size;        /* Total size, in bytes */
   blksize_t st_blksize;     /* Block size for filesystem I/O */
   blkcnt_t  st_blocks;      /* Number of 512B blocks allocated */

   /* Since Linux 2.6, the kernel supports nanosecond
      precision for the following timestamp fields.
      For the details before Linux 2.6, see NOTES. */

   struct timespec st_atim;  /* Time of last access */
   struct timespec st_mtim;  /* Time of last modification */
   struct timespec st_ctim;  /* Time of last status change */

#define st_atime st_atim.tv_sec      /* Backward compatibility */
#define st_mtime st_mtim.tv_sec
#define st_ctime st_ctim.tv_sec
};
       st_size
              This field gives the size of the file (if it is a  regu‐
              lar  file  or  a symbolic link) in bytes.  The size of a
              symbolic link is the length of the pathname it contains,
              without a terminating null byte.

       st_blocks
              This field indicates the number of blocks  allocated  to
              the  file, in 512-byte units.  (This may be smaller than
              st_size/512 when the file has holes.)

where:

disk usage = 512 * st_blocks
apparent size = st_size

The following table shows the difference between these two quantities (on a filesystem with 4KiB block size):

Disk usage Apparent size
empty directory 0 B 40 B
directory with empty subdirectory 0 B 100 B
file-0-bytes 0 B 0 B
file-3-bytes 4096 B 3 B
file-4096-bytes 4096 B 4096 B
file-4097-bytes 8192 B 4097 B
symlink to "file-3-bytes" 0 B 12 B

It looks to me like the number you are getting on Windows is neither disk usage, nor apparent size. It's the "apparent size of all files, excluding directories".

The Windows sysinternals du tool seems to report both "Size" and "Size on disk" (similar to what Windows Explorer does, I guess?). For some reason though, the "size on disk" number is even higher than what I would expect from a "disk usage" option. In my integration test, the output on the test directory which contains a 0B, a 3B and a 7B file as well as an empty directory, is the following:

diskus:

4106

(which is 4096+10)

sysinternals du:

Size:         10 bytes
Size on disk: 24,576 bytes

I have no idea where the latter number of "24,576 = 6 × 4,096" bytes comes from.

But I agree with you that something should be changed on Windows. Maybe we could introduce a --include-directory-size=never/always/auto option where auto would be the default and select always on Linux and never on Windows?

sharkdp avatar Feb 19 '22 12:02 sharkdp

The Windows sysinternals du tool seems to report both "Size" and "Size on disk" (similar to what Windows Explorer does, I guess?).

Yes, this is true, du.exe always shows the same values as Windows does in the File Explorer, for example. Both "Size" and "Size on Disk".

For some reason though, the "size on disk" number is even higher than what I would expect from a "disk usage" option. In my integration test, the output on the test directory which contains a 0B, a 3B and a 7B file as well as an empty directory, is the following:

diskus:

4106

(which is 4096+10)

sysinternals du:

Size:         10 bytes
Size on disk: 24,576 bytes

I have no idea where the latter number of "24,576 = 6 × 4,096" bytes comes from.

Good question,,

PS D:\Temp> ls -fo
PS D:\Temp> mkdir 'emptydir'

    Directory: D:\Temp

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          19.02.2022    15:09                emptydir

PS D:\Temp> 'a' > 3by; 'aaaaa' > 7by
PS D:\Temp> new-item 0by

    Directory: D:\Temp

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          19.02.2022    15:10              0 0by

PS D:\Temp> ls

    Directory: D:\Temp

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          19.02.2022    15:09                emptydir
-a---          19.02.2022    15:10              0 0by
-a---          19.02.2022    15:10              3 3by
-a---          19.02.2022    15:10              7 7by

PS D:\Temp> cd ..
PS D:\> du Temp
Files:        3
Directories:  2
Size:         10 bytes
Size on disk: 24'576 bytes

PS D:\>

I was thinking about the parent directory itself, maybe, but it doesn't really add up. An empty directory in an empty directory (not really empty anymore now) is as expected, and the parent directory does not count:

Files:        0
Directories:  2
Size:         0 bytes
Size on disk: 4'096 bytes

But I agree with you that something should be changed on Windows. Maybe we could introduce a --include-directory-size=never/always/auto option where auto would be the default and select always on Linux and never on Windows?

Sounds good to me!

Hrxn avatar Feb 19 '22 14:02 Hrxn

Windows 7 x64, Diskus 0.7.0

Let's get the folder size in bytes.

$ for /f "tokens=1,2 delims=: " %a in ('robocopy C:\Test . /L /BYTES /S /NJH /NDL /NFL /XJ /R:0 /W:0') do @if /i %a==Bytes echo %b
3012235061

$ pwsh -c "(gci -lp C:\Test -r -force | measure -p length -sum).sum"
3012235061

$ coreutils du -bs C:\Test
3012235061      C:\Test

$ duu -q C:\Test
summary
=======
files         : 60 531
directories   : 7 136
bytes         : 3 012 235 061
kilobytes     : 2 941 635,80
megabytes     : 2 872,69
gigabytes     : 2,81

So far so good, the output matches. But what does Diskus show?

$ diskus.exe --size-format decimal C:\Test
3.04 GB (3,036,213,045 bytes)

$ diskus.exe --size-format binary C:\Test
2.83 GiB (3,036,213,045 bytes)

3 012 235 061 vs 3 036 213 045.

Something is wrong, indeed.

sergeevabc avatar Feb 18 '24 20:02 sergeevabc

Err… Hello?

sergeevabc avatar Mar 11 '24 17:03 sergeevabc

Yes? Did you read the entire issue here? It should explain everything.

Hrxn avatar Mar 11 '24 19:03 Hrxn