parallel-disk-usage Feature Request: Filter files by file extension or regex

Referencing questions raised here:

Do you only filter files? What about directories?

I would like to only get size information of files matching an extension(s) provided where the files are spread throughout multiple nested directories with other file content which file size I want to ignore.

My primary interest is total file size by extension in my case. Being able to get additional metrics like breakdown per directory or largest file locations (or largest directories of this content) are nice to haves.

Some tools provide ways to exclude files or directories, that can still be understandable and desirable in this context.

What is the syntax?

I'm not familiar with this tools CLI syntax as it's not documented and I've not yet downloaded/installed it to try it out.

I have used dutree with it's --agrr=500M option on directories with only the content I am interested in specifically (filtering is only useful here for a breakdown of different extensions, eg all image content and I want to know how much is PNG, what the top 10 sizes/paths are).

dutree also lacks filtering support. It does have an exclusion syntax -x but requires repeating this with every path you want to exclude, rather than a single string of delimited values.

In some software like Caddy I use a regex pattern (\.(jpeg|jpg|gif|png|webp|avif|svg)$) for caching requests, it also allows to specify multiple domains with comma delimiter (example.com,www.example.com,example.org). Anything like that as a value to some arg like --regex/--match/--filter-by/--pattern/etc would be good.

For interactive TUI, some users may find it convenient to instead interactively filter results instead.

I'm personally just interested in the ability to filter by file extension. Presently I am using the following shell script:

find . -type f -printf "%f %s\n" |
  awk '{
      PARTSCOUNT=split( $1, FILEPARTS, "." );
      EXTENSION=PARTSCOUNT == 1 ? "NULL" : FILEPARTS[PARTSCOUNT];
      FILETYPE_MAP[EXTENSION]+=$2
    }
   END {
     for( FILETYPE in FILETYPE_MAP ) {
       print FILETYPE_MAP[FILETYPE], FILETYPE;
      }
   }' | sort -n | numfmt --field=1 --to=iec-i --format "%8f" --suffix B

Which outputs results like this:

  9.5MiB css
   19MiB psd
   22MiB json
   24MiB md
   75MiB jpeg
  158MiB js
  174MiB php
  228MiB webp
  2.4GiB gif
  4.8GiB bsp
  4.8GiB pdmod
   12GiB jpg
   15GiB 7z
   16GiB png
   80GiB rar
   97GiB zip

That's a tad limited in output and what can be easily done vs what the nicer CLI tools for disk usage offer, and in this case it's not filtering by specific extensions but for the given output that's not a concern (I can easily identify the extensions I am interested and their sum size, no further detailed information/breakdown to sift through).

Current tools allow me to exclude dirs or scan specific dirs. If the file content is mixed however, it limits the usefulness of insights beyond overview of top file/dir sizes (or aggregated dir sizes).

Related feature request for dua

Aug 10 '21 07:08 polarathene

Perhaps, one other option would be to use ripgrep to collect all file paths from a regex pattern, then pass that to the CLI to derive insights?

Aug 10 '21 07:08 polarathene

Is there any reason you prefer regex to glob pattern?

Aug 10 '21 07:08 KSXGitHub

Is there any reason you prefer regex to glob pattern?

No problem. Sometimes glob pattern support/syntax varies. As long as it's clearly documented and simple I'm not fussed.

I would just like to have a way to query/filter for disk usage of files of one or more given extensions while ignoring other content if possible.

Aug 10 '21 11:08 polarathene

parallel-disk-usage parallel-disk-usage copied to clipboard

Feature Request: Filter files by file extension or regex

parallel-disk-usage
parallel-disk-usage copied to clipboard