btdu
btdu copied to clipboard
Feature suggestion: metric option when outputting in "du format".
Feature suggestion:
Be able to choose metric when running with --du flag. (Or perhaps, output all four metrics at once in four columns).
Often I am more interested in the distributed, shared and exclusive metrics, or to compare them with each other. (I have deduplicated the data with Bees.)
I also like the --du flag, since you can then sort the output like btdu-static-051-x86_64 --max-time=5m --du --headless /mnt | sort -n | tail -1000, to get a good overview of which directories and files take the most space (also when they might be under different top level directories).
Well, --du is really only there for compatibility with tools that consume du output, such as xdu/xdiskusage. For anything else, I would suggest using --export and e.g. a jq script.
Ah, okay, I need to figure out how to accomplish the same as | sort -n | tail -1000 with jq and the exported json file....
As a starting point, here is a jq script which should produce roughly the same results as --du:
.totalSize as $totalSize |
.root.data.represented.samples as $totalSamples |
def descend(prefix):
(prefix + "/" + .name) as $fullName |
(.data.represented.samples
| if . == null then 0 else . end
| . / $totalSamples * $totalSize
| . / 1024 # convert to 1K blocks
| round
) as $size |
"\($size)\t\($fullName)",
(.children[] | descend($fullName));
.root
| descend("")
Run jq with -r.
Hmmm... Thanks, replacing .data.represented.samples with .data.exclusive.samples, .data.shared.samples or .data.distributedSamples in your script, seems to give the exclusive, the shared, and the distributed metric data respectively.
So I created three (slightly) different jq script files with the help of your suggestion:
btdu-exclusive.jq:
.totalSize as $totalSize |
.root.data.represented.samples as $totalSamples |
def descend(prefix):
(prefix + "/" + .name) as $fullName |
(.data.exclusive.samples
| if . == null then 0 else . end
| . / $totalSamples * $totalSize
| . / 1024 # convert to 1K blocks
| round
) as $size |
"\($size)\t\($fullName)",
(.children[] | descend($fullName));
.root
| descend("")
btdu-distributed.jq:
.totalSize as $totalSize |
.root.data.represented.samples as $totalSamples |
def descend(prefix):
(prefix + "/" + .name) as $fullName |
(.data.distributedSamples
| if . == null then 0 else . end
| . / $totalSamples * $totalSize
| . / 1024 # convert to 1K blocks
| round
) as $size |
"\($size)\t\($fullName)",
(.children[] | descend($fullName));
.root
| descend("")
btdu-shared.jq:
.totalSize as $totalSize |
.root.data.represented.samples as $totalSamples |
def descend(prefix):
(prefix + "/" + .name) as $fullName |
(.data.shared.samples
| if . == null then 0 else . end
| . / $totalSamples * $totalSize
| . / 1024 # convert to 1K blocks
| round
) as $size |
"\($size)\t\($fullName)",
(.children[] | descend($fullName));
.root
| descend("")
If I then run, for example, jq --from-file btdu-distributed.jq --raw-output btdu-export.json | sort -n | tail -1000, (replace with btdu-exclusive.jq or btdu-shared.jq to instead get exclusive sizes or shared sizes respectively) on an exported json file, I think I get what I want.
(Haven't carefully checked the numbers if they are calculated correctly in the script also for the metrics other than represented.)
With an exported json file of size 2.1 GiB, the jq command consumes about 20 - 30 GiB RAM on my computer.
With an exported json file of size 2.1 GiB, the jq command consumes about 20 - 30 GiB RAM on my computer.
That is a lot.
I think a more efficient solution should be possible with --stream.
Here is a jq --stream script:
def updateState(item):
item as [$jsonPath, $value] |
# Start with previous state as base
# Clear output
. * {"output" : []} |
# Process entry
. * (
# Globals
if $jsonPath == ["fsPath"]
then {"path": [$value]}
elif $jsonPath == ["totalSize"]
then {"totalSize": $value}
elif $jsonPath == ["root", "data", "represented", "samples"]
then {"totalSamples": $value}
# Name
elif $jsonPath[-1] == "name"
then {"path": (.path[:($jsonPath | length | (. - 1) / 3)] + [$value])}
# Size
elif $jsonPath[-3:] == ["data", "represented", "samples"]
then
. as $state |
($state.path | join("/")) as $fullName |
($value
| . / $state.totalSamples * $state.totalSize
| . / 1024 # convert to 1K blocks
| round
) as $size |
{"output": ["\($size)\t\($fullName)"]}
else {
# "output": ["Unrecognized line: \(item), state: \(.)"]
}
end
);
{
"path" : [],
"output": [],
} as $initState |
foreach inputs as $item ($initState; updateState($item); .output[])
This version uses a constant amount of memory.
Thanks,
Trying your last script with the elif $jsonPath[-3:] == ["data", "represented", "samples"] line replaced with elif $jsonPath[-2:] == ["data", "distributedSamples"], and used it on the same 2.1 GiB exported file like this:
jq --stream --from-file btdu-stream-distributed.jq --raw-output btdu-export.json | sort -n | tail -1000.
The jq process now consume less than 1 MiB RAM, but still takes about 10 minutes to run (in addition to the time previously spent running Btdu to produce the btdu-export.json file). The previous non stream script also took almost as long to run due to the memory swapping to my Nvme disc.
Need to make more tests with different running times of Btdu, and with SSD-disks versus rotating hard drives. If I e g run Btdu for one hour, an additional 10 minutes for the jq script would not be that bad. Need to test though, that the jq running time is not much longer for longer Btdu runs.