parallel-disk-usage
parallel-disk-usage copied to clipboard
Feature Request: Filter files by file extension or regex
Referencing questions raised here:
Do you only filter files? What about directories?
I would like to only get size information of files matching an extension(s) provided where the files are spread throughout multiple nested directories with other file content which file size I want to ignore.
My primary interest is total file size by extension in my case. Being able to get additional metrics like breakdown per directory or largest file locations (or largest directories of this content) are nice to haves.
Some tools provide ways to exclude files or directories, that can still be understandable and desirable in this context.
What is the syntax?
I'm not familiar with this tools CLI syntax as it's not documented and I've not yet downloaded/installed it to try it out.
I have used dutree
with it's --agrr=500M
option on directories with only the content I am interested in specifically (filtering is only useful here for a breakdown of different extensions, eg all image content and I want to know how much is PNG, what the top 10 sizes/paths are).
dutree
also lacks filtering support. It does have an exclusion syntax -x
but requires repeating this with every path you want to exclude, rather than a single string of delimited values.
In some software like Caddy I use a regex pattern (\.(jpeg|jpg|gif|png|webp|avif|svg)$
) for caching requests, it also allows to specify multiple domains with comma delimiter (example.com,www.example.com,example.org
). Anything like that as a value to some arg like --regex
/--match
/--filter-by
/--pattern
/etc would be good.
For interactive TUI, some users may find it convenient to instead interactively filter results instead.
I'm personally just interested in the ability to filter by file extension. Presently I am using the following shell script:
find . -type f -printf "%f %s\n" |
awk '{
PARTSCOUNT=split( $1, FILEPARTS, "." );
EXTENSION=PARTSCOUNT == 1 ? "NULL" : FILEPARTS[PARTSCOUNT];
FILETYPE_MAP[EXTENSION]+=$2
}
END {
for( FILETYPE in FILETYPE_MAP ) {
print FILETYPE_MAP[FILETYPE], FILETYPE;
}
}' | sort -n | numfmt --field=1 --to=iec-i --format "%8f" --suffix B
Which outputs results like this:
9.5MiB css
19MiB psd
22MiB json
24MiB md
75MiB jpeg
158MiB js
174MiB php
228MiB webp
2.4GiB gif
4.8GiB bsp
4.8GiB pdmod
12GiB jpg
15GiB 7z
16GiB png
80GiB rar
97GiB zip
That's a tad limited in output and what can be easily done vs what the nicer CLI tools for disk usage offer, and in this case it's not filtering by specific extensions but for the given output that's not a concern (I can easily identify the extensions I am interested and their sum size, no further detailed information/breakdown to sift through).
Current tools allow me to exclude dirs or scan specific dirs. If the file content is mixed however, it limits the usefulness of insights beyond overview of top file/dir sizes (or aggregated dir sizes).
Perhaps, one other option would be to use ripgrep to collect all file paths from a regex pattern, then pass that to the CLI to derive insights?
Is there any reason you prefer regex to glob pattern?
Is there any reason you prefer regex to glob pattern?
No problem. Sometimes glob pattern support/syntax varies. As long as it's clearly documented and simple I'm not fussed.
I would just like to have a way to query/filter for disk usage of files of one or more given extensions while ignoring other content if possible.