filesystem-reporting-tools
filesystem-reporting-tools copied to clipboard
add -x flag from du
Implement -x flag from du. -x "Skip directories that are on different filesystems from the one that the argument being processed is on." Crossing file systems and or mount points can give very inaccurate information if you are monitoring volume growth. I will use the st_dev flag from stat to detect this.
I have a question about this planned option, @fizwit.
Is it true that "directories that are on different filesystems from the one that the argument being processed is on" can only arise when they are symbolic links?
And if so, then this should never happen since symbolic links are never followed.
I am having new considerations for issues such as symbolic links and filesystems / st_dev as we have recently adopted Komprise which migrates files that are both "old" and "large" to secondary storage (in my case, and it leaves a symbolic link in its place.
I would like optionally for pwalk to be able to traverse such symbolic links BUT NOT OTHERS.
Would you consider extending pwalk to my usecase?
For example, I note that du supports -D:
-D, --dereference-args
dereference only symlinks that are listed on the command line
With a little creative extension, we could overload -D for pwalk, and allow it to additionally
dereference symlinks whose target is within to a specified location that are listed on the command line beginning with a '->'
For example:
pwalk -D ->/my/secondary/storage /stuff/to/report/on
would only dereference symlinks that were created by komprise when it moved a file to /my/secondary/storage
@bmcgough - I am interested in whether you have such considerations at FH, and if so, how you address them.
It is philosophically incorrect to report storage from two different devices. The UNIX command df reports volume sizes and pwalk reports file sizes by volume. Your hierarchical storage management system is migrating files to a secondary volume then you should run a separate report on that volume. The sum of your storage would be your primary and secondary reports. I would think that knowing the size of your primary and secondary would be valuable information.
Since symlinks are a part of the primary volume I report on the symbolic link itself, if I were to support HMS, I would only report on the one file pointed to by the symbolic link. I would use the inode of the symlink as the parent inode. I'm not sure how I would store the inode of the remote system. I have used the inode as a primary key with databases, an inode from a remote system would not be guaranteed to be unique. Maybe use negative one as the inode to signify that the file is from a foreign volume. If I used a negative one then you would not be able to use inode as a primary key. I might call the feature --HMS since it would be very different function than -D from du.
+----------------------+ +------------+
| | | |
| Volume One | | Volume Two |
| stub file - Sym Link ----------> file |
| | | |
+----------------------+ +------------+
Thank @fizwit - I agree with your argument.
Except - warning - if there are hardlinks between files on a given volume, they will have the same inode and mess up your use of inode as primary key in a database. Careful there!
I will ruminate on how I want to treat my stub files going forward. In some applications, I will want to tally their size on the secondary storage. If there were a quick way to identify all symbolic links from the output of pwalk, I could rather easily post-process my initial scan, and test which of the symbolic links point to secondary storage. On the other hand I could plan to run pwalk on the secondary storage and merge the results.
This is my problem. Thanks for you consideration.
Yes, hardlinks have been a problem for us. Right now we actually treat the full file name as our primary key, so we capture hardlinks. But you have to take this into account when calculating storage space and only tally based on inode (st_nlink might be useful here).
You can tell symlinks from the pwalk output (ex: Python https://docs.python.org/3/library/stat.html#stat.S_ISDIR), but not what they target. That would involve reading the symlink, which is the same as reading files, which pwalk does not do at this time.
You could provide this mapping yourself with minimal additional crawling by extracting the symlinks from pwalk output on the primary file system, reading those links, and referencing them in pwalk output from the pwalk output on your secondary file system.
We are using PostgreSQL to merge our parallel pwalks. It is not real-time fast, but seems to be reliable and allows for relatively complex relational queries. Elasticsearch is real-time fast, but has limited relational and hierarchy capabilities. But querying across two indices is not complex.
Honestly the hardest part of all this is creating a byte-clean pipeline; we have filenames with bytes in them that correspond to no defined character set, so even good unicode support does not guarantee resiliency; and this is only going to get worse in the future.
I too am loading PostgreSQL (v11) with pwalk output as outlined here, which I just amended to detail some experience with ltree encoding of filename and jsonb storage of metadata.
@bmcgough - I would be very interested if you would similarly update the Ben McGough approach, especially if you made any progress on your "next steps".