fish-shell icon indicating copy to clipboard operation
fish-shell copied to clipboard

`**` glob witnessing dataless directories materializes them on macOS

Open floam opened this issue 4 years ago • 21 comments
trafficstars

Files that exist "dataless" on disk are being forced to download and sync locally just by iterating over them with **.

This affects stuff like iCloud paths, likely dropbox, Google Drive, box accounts etc. (assuming they now use this) and definitely S3/DAV/FTP/whatever shares through Strongsync which uses the FileProvider API.

"dataless" is a state supported by APFS in which a file or directory is a placeholder, and its children (for a directory) or content (for a file) will be fetched when the user tries to read it.

ls -l% is an easy way to see if files are dataless or not.

  -%      Distinguish dataless files and directories with a '%' character in long (-l) output, and
          don't materialize dataless directories when listing them.

I first noticed this when trying to determine why ** in my home directory could take minutes. After attempting to debug it for a while I noticed all of a sudden it was quite fast. ~~But my hard disk was almost empty.~~

~/L/C/Strongsync-SPMDrive►echo *
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me sonicwall-TZ_400-6_2_3_1-19n.exp SPM techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
~/L/C/Strongsync-SPMDrive►ls -l%
total 0
-rw-------%  1 floam  staff      317 May  6  2015 Inspection Entry.webloc
-rw-------%  1 floam  staff      324 Oct 29  2014 Inspection Record.webloc
-rw-------%  1 floam  staff      324 Oct  6  2016 OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc
drwx------% 20 floam  staff      640 Sep 29  2015 SPM/
drwx------%  2 floam  staff       64 Nov  2 00:13 Shared Drives/
drwx------% 10 floam  staff      320 Nov  2 00:13 Shared with Me/
-rw-------%  1 floam  staff      324 Oct 29  2014 Untitled spreadsheet.webloc
-rw-------%  1 floam  staff    78132 Apr 23  2016 bugreceipt.png
-rw-------%  1 floam  staff      324 Jan 21  2016 log_0D1CD0_1-21.webloc
-rw-------%  1 floam  staff   808062 Jan 21  2016 sonicwall-TZ_400-6_2_3_1-19n.exp
-rw-------%  1 floam  staff   738966 Jan 21  2016 techSupport_0D1CD0_1-21.wri
-rw-------%  1 floam  staff   492607 Apr  1  2016 test.pdf
-rw-------%  1 floam  staff  8467525 Apr 18  2016 textual.mov
~/L/C/Strongsync-SPMDrive►echo ** 
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me Shared with Me/BOTTOM.step Shared with Me/image0.jpeg Shared with Me/image1.jpeg Shared with Me/image2.jpeg Shared with Me/image3.jpeg Shared with Me/image5.png Shared with Me/sw_tz-400_eng_6.2.5.0_6.2.5_15n_858612.sig Shared with Me/TOP.step sonicwall-TZ_400-6_2_3_1-19n.exp SPM SPM/IMG_0393.JPG SPM/IMG_0394.JPG SPM/IMG_0395.JPG SPM/IMG_0396.JPG SPM/IMG_0397.JPG SPM/IMG_0398.JPG SPM/IMG_0399.JPG SPM/IMG_0400.JPG SPM/IMG_0401.JPG SPM/IMG_0402.JPG SPM/IMG_0403.JPG SPM/IMG_0404.JPG SPM/IMG_0405.JPG SPM/IMG_0406.JPG SPM/IMG_0407.JPG SPM/IMG_0408.JPG SPM/IMG_0412.JPG SPM/IMG_0413.JPG techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
~/L/C/Strongsync-SPMDrive►echo $CMD_DURATION
9486
ls -l%
total 0
-rw-------%  1 floam  staff      317 May  6  2015 Inspection Entry.webloc
-rw-------%  1 floam  staff      324 Oct 29  2014 Inspection Record.webloc
-rw-------%  1 floam  staff      324 Oct  6  2016 OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc
drwx------@ 20 floam  staff      640 Sep 29  2015 SPM/
drwx------@  2 floam  staff       64 Nov  2 00:13 Shared Drives/
drwx------@ 10 floam  staff      320 Nov  2 00:13 Shared with Me/
-rw-------%  1 floam  staff      324 Oct 29  2014 Untitled spreadsheet.webloc
-rw-------%  1 floam  staff    78132 Apr 23  2016 bugreceipt.png
-rw-------%  1 floam  staff      324 Jan 21  2016 log_0D1CD0_1-21.webloc
-rw-------%  1 floam  staff   808062 Jan 21  2016 sonicwall-TZ_400-6_2_3_1-19n.exp
-rw-------%  1 floam  staff   738966 Jan 21  2016 techSupport_0D1CD0_1-21.wri
-rw-------%  1 floam  staff   492607 Apr  1  2016 test.pdf
-rw-------%  1 floam  staff  8467525 Apr 18  2016 textual.mov

Notice the three directories that have apparently forced remote file servers to be queried, etc.

Neither ls (except -l sans -%, see quoted manual above), du, nor echo ** on zsh has this behavior. We should figure out how to avoid it.

floam avatar Nov 02 '21 07:11 floam

FYI useful in debugging this is fileproviderctl evict -n <FILES> which will return the lazily materialized files/directories to a dataless state if they come from something using FileProvider.

floam avatar Nov 02 '21 07:11 floam

from du.c:

	// "du" should not have any side effect on disk usage,
	// so prevent materializing dataless directories upon traversal
	rval = 1;
	(void) sysctlbyname("vfs.nspace.prevent_materialization", NULL, NULL, &rval, sizeof(rval));

I see in xnu sources there is also thread_prevent_materialization.

floam avatar Nov 02 '21 07:11 floam

can also search for "dataless" in file_cmds/ls/ls.c.

floam avatar Nov 02 '21 07:11 floam

FYI, this is something added in macOS 10.15+.

floam avatar Nov 02 '21 07:11 floam

Hm, it looks like zsh needed no special measures to not do this.

floam avatar Nov 04 '21 02:11 floam

I think I'll need to spy on bash or zsh while doing traversal on ** because I'm kind of stumped.

floam avatar Nov 04 '21 03:11 floam

What I'd look into: One big difference between how bash/zsh and fish handle recursive globs is that fish will follow symlinks and remember which files it already viewed to avoid links, while bash/zsh just ignore links.

So if this looked like a link or resolving the link caused a materialization, that would be triggered by fish but not zsh.

Or it's possible that e.g. stat() causes a materialization but access() doesn't or something similar.

Or fish has the materialization entitlement but zsh doesn't? Should it have it? If not, can we just revoke it?

faho avatar Nov 04 '21 07:11 faho

Can definitely stat without materializing. (I think stat is necessary to determine if a path has this property, it's the '4' in the most significant digit of st_flags according to stat.h.)

It's going to be yeah, how access() was used or not or whether or not or how it ever did a readdir(), or something.

floam@M1 ~/L/C/Strongsync-SPMDrive> sh -c 'echo **'
Inspection Entry.webloc Inspection Record.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc SPM Shared Drives Shared with Me Untitled spreadsheet.webloc bugreceipt.png log_0D1CD0_1-21.webloc sonicwall-TZ_400-6_2_3_1-19n.exp techSupport_0D1CD0_1-21.wri test.pdf textual.mov
floam@M1 ~/L/C/Strongsync-SPMDrive> echo $CMD_DURATION
13
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------% 21 floam  staff  compressed,dataless 672 Sep 29  2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -s SPM
st_dev=16777231 st_ino=75398548 st_mode=040700 st_nlink=21 st_uid=501 st_gid=20 st_rdev=0 st_size=672 st_atime=1636012301 st_mtime=1443567058 st_ctime=1636012301 st_birthtime=1443567058 st_blksize=4096 st_blocks=0 st_flags=1073741856
floam@M1 ~/L/C/Strongsync-SPMDrive> printf %x 1073741856
40000020
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -x SPM
  File: "SPM"
  Size: 672          FileType: Directory
  Mode: (0700/drwx------)         Uid: (  501/   floam)  Gid: (   20/   staff)
Device: 1,15   Inode: 75398548    Links: 21
Access: Thu Nov  4 00:50:07 2021
Modify: Tue Sep 29 15:50:58 2015
Change: Thu Nov  4 00:50:07 2021
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------% 21 floam  staff  compressed,dataless 672 Sep 29  2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> echo **
bugreceipt.png Inspection Entry.webloc Inspection Record.webloc log_0D1CD0_1-21.webloc OrgData-smallpartsmfg.com-1p2uljz3ras55u-20160720.csv.webloc Shared Drives Shared with Me Shared with Me/BOTTOM.step Shared with Me/image0.jpeg Shared with Me/image1.jpeg Shared with Me/image2.jpeg Shared with Me/image3.jpeg Shared with Me/image5.png Shared with Me/sw_tz-400_eng_6.2.5.0_6.2.5_15n_858612.sig Shared with Me/TOP.step sonicwall-TZ_400-6_2_3_1-19n.exp SPM SPM/foo SPM/IMG_0393.JPG SPM/IMG_0394.JPG SPM/IMG_0395.JPG SPM/IMG_0396.JPG SPM/IMG_0397.JPG SPM/IMG_0398.JPG SPM/IMG_0399.JPG SPM/IMG_0400.JPG SPM/IMG_0401.JPG SPM/IMG_0402.JPG SPM/IMG_0403.JPG SPM/IMG_0404.JPG SPM/IMG_0405.JPG SPM/IMG_0406.JPG SPM/IMG_0407.JPG SPM/IMG_0408.JPG SPM/IMG_0412.JPG SPM/IMG_0413.JPG techSupport_0D1CD0_1-21.wri test.pdf textual.mov Untitled spreadsheet.webloc
floam@M1 ~/L/C/Strongsync-SPMDrive> echo $CMD_DURATION
3438
floam@M1 ~/L/C/Strongsync-SPMDrive> ls -lO%d SPM
drwx------@ 21 floam  staff  - 672 Sep 29  2015 SPM/
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -x SPM
  File: "SPM"
  Size: 672          FileType: Directory
  Mode: (0700/drwx------)         Uid: (  501/   floam)  Gid: (   20/   staff)
Device: 1,15   Inode: 75398548    Links: 21
Access: Thu Nov  4 00:50:07 2021
Modify: Tue Sep 29 15:50:58 2015
Change: Thu Nov  4 00:51:06 2021
floam@M1 ~/L/C/Strongsync-SPMDrive> stat -s SPM
st_dev=16777231 st_ino=75398548 st_mode=040700 st_nlink=21 st_uid=501 st_gid=20 st_rdev=0 st_size=672 st_atime=1636012301 st_mtime=1443567058 st_ctime=1636012603 st_birthtime=1443567058 st_blksize=4096 st_blocks=0 st_flags=0

floam avatar Nov 04 '21 07:11 floam

(And no, that's not one of the entitlements fish has. And the binary I run has zero entitlements.)

floam avatar Nov 04 '21 08:11 floam

access() won't materialize (and F_OK, R_OK are affirmative), opendir won't materialize, readdir() loop will materialize.

floam avatar Nov 04 '21 08:11 floam

int main() {
	struct stat foo;

	int s = stat("SPM", &foo);
	if (S_ISLNK(foo.st_mode))
		printf("S_ISLNK\n");
	if (S_ISDIR(foo.st_mode))
		printf("S_ISDIR\n");

	return 1;
}

Prints just "S_ISDIR" and does not materialize the dateless directory.

floam avatar Nov 04 '21 08:11 floam

readdir() loop will materialize

That poses three questions:

  1. Is there a way we get the filename without materialization? Maybe a way to skip reading into dataless directories?
  2. If there isn't, is there a nice way to read through this without materialization?
  3. Can we do it without performance regressions elsewhere? I'd rather #ifdef APPLE this than make other systems worse.

faho avatar Nov 04 '21 08:11 faho

The answer to number 1 is definitely yes. See the examples I posted above - as long as you don't do an opendir on the directory itself that is dateless (in my examples, "SPM"), it won't materialize. You can opendir over the parent directory of course. You can ls the dateless directory, you can stat it, you can even opendir it. Just do not try to traverse it with readdir, it seems? Or stat an actual child of it.

We can identify a directory like this by looking at st_flags. I think st_flags & SF_DATALESS.

What is confusing to me is how bash and zsh are seemingly not needing to special case this.

floam avatar Nov 04 '21 08:11 floam

Oh, regarding number 1: we cannot get the filenames of the unmaterialized files inside the dateless directory. You'll notice that while zsh and bash can echo **, they do not show contents of these directories. Until you actually chdir into one or specify a path there.

floam avatar Nov 04 '21 08:11 floam

We can identify a directory like this by looking at st_flags. I think st_flags & SF_DATALESS.

So something like adding

       #ifdef SF_DATALESS
       if (buf.st_flags & SF_DATALESS) continue;
       #endif

after https://github.com/fish-shell/fish-shell/blob/6a7ba7921a04f7bdf02afaeae85ae36dbd9d0d4e/src/wildcard.cpp#L772-L777?

faho avatar Nov 04 '21 08:11 faho

Yeah - but I would like to better understand what's going on first.

floam avatar Nov 04 '21 08:11 floam

Nice find! Can files be dataless or only directories? (On Windows, only files can be "dataless")

EDIT: going through the sources, both may be dataless. So the question is if we're currently only materializing the directories or both?

mqudsi avatar Nov 05 '21 17:11 mqudsi

Only directories.

floam avatar Nov 05 '21 18:11 floam

(which doesn't explain my empty hard disk: perhaps a coincidence.)

But it certifiably makes globs… really slow.

floam avatar Nov 05 '21 18:11 floam

Okay, I'm going to close this one as "not planned" since it appears to not be happening.

It looks like it's clear what would have to be done, but someone with a macOS system needs to try it out and report back.

faho avatar Jan 30 '24 10:01 faho

For anyone who comes across this old issue, this link might be useful:

https://developer.apple.com/documentation/technotes/tn3150-getting-ready-for-data-less-files#Understand-the-impact-of-file-materialization

latenitefilms avatar Jul 01 '24 12:07 latenitefilms