fish-shell
fish-shell copied to clipboard
`**` glob witnessing dataless directories materializes them on macOS
Files that exist "dataless" on disk are being forced to download and sync locally just by iterating over them with **.
This affects stuff like iCloud paths, likely dropbox, Google Drive, box accounts etc. (assuming they now use this) and definitely S3/DAV/FTP/whatever shares through Strongsync which uses the FileProvider API.
"dataless" is a state supported by APFS in which a file or directory is a placeholder, and its children (for a directory) or content (for a file) will be fetched when the user tries to read it.
ls -l% is an easy way to see if files are dataless or not.
-% Distinguish dataless files and directories with a '%' character in long (-l) output, and
don't materialize dataless directories when listing them.
I first noticed this when trying to determine why ** in my home directory could take minutes. After attempting to debug it for a while I noticed all of a sudden it was quite fast. ~~But my hard disk was almost empty.~~
~/L/C/Strongsync-SPMDrive►echo *
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me sonicwall-TZ_400-6_2_3_1-19n.exp SPM techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
~/L/C/Strongsync-SPMDrive►ls -l%
total 0
-rw-------% 1 floam staff 317 May 6 2015 Inspection Entry.webloc
-rw-------% 1 floam staff 324 Oct 29 2014 Inspection Record.webloc
-rw-------% 1 floam staff 324 Oct 6 2016 OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc
drwx------% 20 floam staff 640 Sep 29 2015 SPM/
drwx------% 2 floam staff 64 Nov 2 00:13 Shared Drives/
drwx------% 10 floam staff 320 Nov 2 00:13 Shared with Me/
-rw-------% 1 floam staff 324 Oct 29 2014 Untitled spreadsheet.webloc
-rw-------% 1 floam staff 78132 Apr 23 2016 bugreceipt.png
-rw-------% 1 floam staff 324 Jan 21 2016 log_0D1CD0_1-21.webloc
-rw-------% 1 floam staff 808062 Jan 21 2016 sonicwall-TZ_400-6_2_3_1-19n.exp
-rw-------% 1 floam staff 738966 Jan 21 2016 techSupport_0D1CD0_1-21.wri
-rw-------% 1 floam staff 492607 Apr 1 2016 test.pdf
-rw-------% 1 floam staff 8467525 Apr 18 2016 textual.mov
~/L/C/Strongsync-SPMDrive►echo **
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me Shared with Me/BOTTOM.step Shared with Me/image0.jpeg Shared with Me/image1.jpeg Shared with Me/image2.jpeg Shared with Me/image3.jpeg Shared with Me/image5.png Shared with Me/sw_tz-400_eng_6.2.5.0_6.2.5_15n_858612.sig Shared with Me/TOP.step sonicwall-TZ_400-6_2_3_1-19n.exp SPM SPM/IMG_0393.JPG SPM/IMG_0394.JPG SPM/IMG_0395.JPG SPM/IMG_0396.JPG SPM/IMG_0397.JPG SPM/IMG_0398.JPG SPM/IMG_0399.JPG SPM/IMG_0400.JPG SPM/IMG_0401.JPG SPM/IMG_0402.JPG SPM/IMG_0403.JPG SPM/IMG_0404.JPG SPM/IMG_0405.JPG SPM/IMG_0406.JPG SPM/IMG_0407.JPG SPM/IMG_0408.JPG SPM/IMG_0412.JPG SPM/IMG_0413.JPG techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
~/L/C/Strongsync-SPMDrive►echo $CMD_DURATION
9486
ls -l%
total 0
-rw-------% 1 floam staff 317 May 6 2015 Inspection Entry.webloc
-rw-------% 1 floam staff 324 Oct 29 2014 Inspection Record.webloc
-rw-------% 1 floam staff 324 Oct 6 2016 OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc
drwx------@ 20 floam staff 640 Sep 29 2015 SPM/
drwx------@ 2 floam staff 64 Nov 2 00:13 Shared Drives/
drwx------@ 10 floam staff 320 Nov 2 00:13 Shared with Me/
-rw-------% 1 floam staff 324 Oct 29 2014 Untitled spreadsheet.webloc
-rw-------% 1 floam staff 78132 Apr 23 2016 bugreceipt.png
-rw-------% 1 floam staff 324 Jan 21 2016 log_0D1CD0_1-21.webloc
-rw-------% 1 floam staff 808062 Jan 21 2016 sonicwall-TZ_400-6_2_3_1-19n.exp
-rw-------% 1 floam staff 738966 Jan 21 2016 techSupport_0D1CD0_1-21.wri
-rw-------% 1 floam staff 492607 Apr 1 2016 test.pdf
-rw-------% 1 floam staff 8467525 Apr 18 2016 textual.mov
Notice the three directories that have apparently forced remote file servers to be queried, etc.
Neither ls (except -l sans -%, see quoted manual above), du, nor echo ** on zsh has this behavior. We should figure out how to avoid it.
FYI useful in debugging this is fileproviderctl evict -n <FILES> which will return the lazily materialized files/directories to a dataless state if they come from something using FileProvider.
from du.c:
// "du" should not have any side effect on disk usage,
// so prevent materializing dataless directories upon traversal
rval = 1;
(void) sysctlbyname("vfs.nspace.prevent_materialization", NULL, NULL, &rval, sizeof(rval));
I see in xnu sources there is also thread_prevent_materialization.
can also search for "dataless" in file_cmds/ls/ls.c.
FYI, this is something added in macOS 10.15+.
Hm, it looks like zsh needed no special measures to not do this.
I think I'll need to spy on bash or zsh while doing traversal on ** because I'm kind of stumped.
What I'd look into: One big difference between how bash/zsh and fish handle recursive globs is that fish will follow symlinks and remember which files it already viewed to avoid links, while bash/zsh just ignore links.
So if this looked like a link or resolving the link caused a materialization, that would be triggered by fish but not zsh.
Or it's possible that e.g. stat() causes a materialization but access() doesn't or something similar.
Or fish has the materialization entitlement but zsh doesn't? Should it have it? If not, can we just revoke it?
Can definitely stat without materializing. (I think stat is necessary to determine if a path has this property, it's the '4' in the most significant digit of st_flags according to stat.h.)
It's going to be yeah, how access() was used or not or whether or not or how it ever did a readdir(), or something.
floam@M1 ~/L/C/Strongsync-SPMDrive> sh -c 'echo **'
Inspection Entry.webloc Inspection Record.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc SPM Shared Drives Shared with Me Untitled spreadsheet.webloc bugreceipt.png log_0D1CD0_1-21.webloc sonicwall-TZ_400-6_2_3_1-19n.exp techSupport_0D1CD0_1-21.wri test.pdf textual.mov
floam@M1 ~/L/C/Strongsync-SPMDrive> echo $CMD_DURATION
13
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------% 21 floam staff compressed,dataless 672 Sep 29 2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -s SPM
st_dev=16777231 st_ino=75398548 st_mode=040700 st_nlink=21 st_uid=501 st_gid=20 st_rdev=0 st_size=672 st_atime=1636012301 st_mtime=1443567058 st_ctime=1636012301 st_birthtime=1443567058 st_blksize=4096 st_blocks=0 st_flags=1073741856
floam@M1 ~/L/C/Strongsync-SPMDrive> printf %x 1073741856
40000020
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -x SPM
File: "SPM"
Size: 672 FileType: Directory
Mode: (0700/drwx------) Uid: ( 501/ floam) Gid: ( 20/ staff)
Device: 1,15 Inode: 75398548 Links: 21
Access: Thu Nov 4 00:50:07 2021
Modify: Tue Sep 29 15:50:58 2015
Change: Thu Nov 4 00:50:07 2021
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------% 21 floam staff compressed,dataless 672 Sep 29 2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> echo **
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me Shared with Me/BOTTOM.step Shared with Me/image0.jpeg Shared with Me/image1.jpeg Shared with Me/image2.jpeg Shared with Me/image3.jpeg Shared with Me/image5.png Shared with Me/sw_tz-400_eng_6.2.5.0_6.2.5_15n_858612.sig Shared with Me/TOP.step sonicwall-TZ_400-6_2_3_1-19n.exp SPM SPM/foo SPM/IMG_0393.JPG SPM/IMG_0394.JPG SPM/IMG_0395.JPG SPM/IMG_0396.JPG SPM/IMG_0397.JPG SPM/IMG_0398.JPG SPM/IMG_0399.JPG SPM/IMG_0400.JPG SPM/IMG_0401.JPG SPM/IMG_0402.JPG SPM/IMG_0403.JPG SPM/IMG_0404.JPG SPM/IMG_0405.JPG SPM/IMG_0406.JPG SPM/IMG_0407.JPG SPM/IMG_0408.JPG SPM/IMG_0412.JPG SPM/IMG_0413.JPG techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
floam@M1 ~/L/C/Strongsync-SPMDrive> echo $CMD_DURATION
3438
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------@ 21 floam staff - 672 Sep 29 2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -x SPM
File: "SPM"
Size: 672 FileType: Directory
Mode: (0700/drwx------) Uid: ( 501/ floam) Gid: ( 20/ staff)
Device: 1,15 Inode: 75398548 Links: 21
Access: Thu Nov 4 00:50:07 2021
Modify: Tue Sep 29 15:50:58 2015
Change: Thu Nov 4 00:51:06 2021
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -s SPM
st_dev=16777231 st_ino=75398548 st_mode=040700 st_nlink=21 st_uid=501 st_gid=20 st_rdev=0 st_size=672 st_atime=1636012301 st_mtime=1443567058 st_ctime=1636012603 st_birthtime=1443567058 st_blksize=4096 st_blocks=0 st_flags=0
(And no, that's not one of the entitlements fish has. And the binary I run has zero entitlements.)
access() won't materialize (and F_OK, R_OK are affirmative), opendir won't materialize, readdir() loop will materialize.
int main() {
struct stat foo;
int s = stat("SPM", &foo);
if (S_ISLNK(foo.st_mode))
printf("S_ISLNK\n");
if (S_ISDIR(foo.st_mode))
printf("S_ISDIR\n");
return 1;
}
Prints just "S_ISDIR" and does not materialize the dateless directory.
readdir() loop will materialize
That poses three questions:
- Is there a way we get the filename without materialization? Maybe a way to skip reading into dataless directories?
- If there isn't, is there a nice way to read through this without materialization?
- Can we do it without performance regressions elsewhere? I'd rather #ifdef APPLE this than make other systems worse.
The answer to number 1 is definitely yes. See the examples I posted above - as long as you don't do an opendir on the directory itself that is dateless (in my examples, "SPM"), it won't materialize. You can opendir over the parent directory of course. You can ls the dateless directory, you can stat it, you can even opendir it. Just do not try to traverse it with readdir, it seems? Or stat an actual child of it.
We can identify a directory like this by looking at st_flags. I think st_flags & SF_DATALESS.
What is confusing to me is how bash and zsh are seemingly not needing to special case this.
Oh, regarding number 1: we cannot get the filenames of the unmaterialized files inside the dateless directory. You'll notice that while zsh and bash can echo **, they do not show contents of these directories. Until you actually chdir into one or specify a path there.
We can identify a directory like this by looking at st_flags. I think st_flags & SF_DATALESS.
So something like adding
#ifdef SF_DATALESS
if (buf.st_flags & SF_DATALESS) continue;
#endif
after https://github.com/fish-shell/fish-shell/blob/6a7ba7921a04f7bdf02afaeae85ae36dbd9d0d4e/src/wildcard.cpp#L772-L777?
Yeah - but I would like to better understand what's going on first.
Nice find! Can files be dataless or only directories? (On Windows, only files can be "dataless")
EDIT: going through the sources, both may be dataless. So the question is if we're currently only materializing the directories or both?
Only directories.
(which doesn't explain my empty hard disk: perhaps a coincidence.)
But it certifiably makes globs… really slow.
Okay, I'm going to close this one as "not planned" since it appears to not be happening.
It looks like it's clear what would have to be done, but someone with a macOS system needs to try it out and report back.
For anyone who comes across this old issue, this link might be useful:
https://developer.apple.com/documentation/technotes/tn3150-getting-ready-for-data-less-files#Understand-the-impact-of-file-materialization