compsize icon indicating copy to clipboard operation
compsize copied to clipboard

Feature request: compsize --find

Open Forza-tng opened this issue 3 years ago • 3 comments

I'd like an option to use compsize to find files with extents matching some criteria and list them.

compsize --find could have the following matches:

  • compression type; e.g zlib, lzo, zstd
  • extents larger than X
  • extends smaller than X
  • average extent size smaller than X
  • average extent size larger than X
  • has at least X shared/deduped extents/blocks
  • has no shared extents
  • has slack larger than X
  • has more than X extents
  • has fewer than X extents
  • files larger than X bytes (kiB, MiB, GiB, ...)
  • files smaller than X bytes
  • files that are inlined in metadata
  • files that are not inlined in metadata
  • files with more than X non-continous extents

It should be possible to combine several matches.

The output can be a table with the path/filename + the chosen matches.

A possibly to sort the output would be great, or have an output format that can be piped to sort.

My initial use-case is to find highly fragmented files so that I can manually defrag them. But also to analyse my files and how they are to determine if I should do some action on them.

Forza-tng avatar Jul 08 '22 13:07 Forza-tng

Any of the numbers other than shared extents are easy to get; shared extents would require two passes as we don't know yet about other files that are yet to be processed.

On the other hand, designing and implementing a reasonable interface is not trivial.

kilobyte avatar Jul 09 '22 13:07 kilobyte

Any of the numbers other than shared extents are easy to get; shared extents would require two passes as we don't know yet about other files that are yet to be processed.

On the other hand, designing and implementing a reasonable interface is not trivial.

Don't we get shared extents today, at least within the search path? (the referenced vs actual usage)

# compsize home/
Processed 11696 files, 5763 regular extents (21001 refs), 1134 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       56%       89M         158M         471M
none       100%       58M          58M         175M
zstd        30%       30M          99M         296M

Forza-tng avatar Jul 09 '22 13:07 Forza-tng

"Referenced" counts the number of times each reference is seen in the files named on the command line. It doesn't tell you whether the underlying extents are shared.

It doesn't handle hard links, and merely repeating a file name gets all of its references "shared":

# compsize foo
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M         9.6M       
none       100%      9.6M         9.6M         9.6M       
# compsize foo foo
Processed 2 files, 1 regular extents (2 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M          19M       
none       100%      9.6M         9.6M          19M       

An extent will appear to be unshared if you didn't provide files containing all of the references on the command line:

# cp --reflink=always foo bar
# compsize foo
Processed 1 file, 1 regular extents (1 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M         9.6M       
none       100%      9.6M         9.6M         9.6M       
# compsize foo bar
Processed 2 files, 1 regular extents (2 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      9.6M         9.6M          19M       
none       100%      9.6M         9.6M          19M       

If we want to know whether an extent is shared in general, we have to look at the backrefs and reference counter for the extent in the extent tree (if it's >1 then we know immediately it is shared), then work backwards up the subvol tree to see if there are multiple roots in its ancestry (we can stop as soon as we find a second parent). The first step is an easy TREE_SEARCH on the extent tree. The second step is expensive: the choices are to use the LOGICAL_INO ioctl, which will find every reference, so it's more work than needed to calcuate shared/not-shared; or read the block device directly, which adds significant extra complexity and some race conditions that will need to be handled somehow (accept lower accuracy or run a retry loop).

That said, running LOGICAL_INO on each unique extent would do the job, it will just take more time and IO than compsize would normally use.

Zygo avatar Jul 09 '22 15:07 Zygo