dust
dust copied to clipboard
Memory allocation issue when running on windows
Hi.
I'm trying to count file on 30M files dataset on SMB. Anything can be done to overcome it, or I reached the maximum scale of dust ? Thanks !
.\dust -F -j -r -d 4 -n 100 -s 400000 -f \\server\share$\Groups Indexing: \\server\share$\Groups 9949070 files, 9.5M ... /memory allocation of 262144 bytes failed
Just an update .. running against the same repository from Linux client completes successfully. I suspect this is an issue which is relevant only to windows version
Can you try running dust with more memory: eg: dust -S 1073741824 -S lets you specify stack size so you can try increasing / decreasing the number and see if windows sorts itself out.
C:\DUST>C:\DUST\dust.exe -S 1073741824 -D -p -j -r -f -n 100 -d 7 -z 200000 "\\srv\c$\folder" Indexing: \\srv\c$\folder 12401021 files, 11M ... \memory allocation of 262144 bytes failed
I'm not sure I can do anything here. If windows is failing to assign enough memory to run dust, I'm not sure if there is anything I can do.
I'd recommend repeatedly halving the number in -S and then repeatedly doubling it and seeing if you can get a good run.
I see the same also on linux on file systems with many million files.
I will try to play with the -S but as far as I can see it is a general scalability issue.
BTW, did you try it on file systems with 20-30 million files or more?
The same on linux ? Ok, let me try and recreate on linux.
Using these 2 scripts I made a large number of files on my ext4 filesystem:
cat ~/temp/many_files/make.sh
#! /bin/bash
for n in {1..1000}; do
dd if=/dev/urandom of=file$( printf %03d "$n" ).bin bs=1 count=$(( RANDOM + 1024 ))
done
cat ~/temp/many_files/silly4/make.sh
#! /bin/bash
for n in {1..1000}; do
mkdir $n
touch $n/bspl{00001..09009}.$n
done
Gives:
(collapse)andy:(0):~/dev/rust/dust$ dust -f ~/temp/ -n 10
99,003 ┌── many_small │█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 0%
599,419 ├── many_small2│██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 1%
900,982 ├── silly2 │███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 2%
999,031 ├── silly │████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 2%
2,232,767 ├── silly3 │████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 5%
9,009,001 ├── silly4 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly5 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly6 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly7 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
40,887,211 ┌─┴ many_files │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
40,887,212 ┌─┴ temp │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
(collapse)andy:(0):~/dev/rust/dust$
I think by the time you are getting up to tracking a few tens of million files you are pushing the memory limits of your average system. HTOP certainly wasn't very happy when I ran the above ^
I ran an identical command to what you did and It worked. In my use case there are a few differences that may related:
- I use SMB or NFS to access the fs over the network
- My directory structure is more complex (can get deep and narrow)
- There are long directories and file names
Anyway, the servers I use have 32G of RAM and are doing nothing else. Is there any way I can use it to debug?
Thanks!
I'm not sure I can offer much more.
adding '-d' doesn't make it useless memory.
I can only suggest cd-ing into a subdirectory so it has less data to trawl through.
Thanks !
I will learn some rust and run some debugs myself.
I will let you know if something pops
Hi.
I could easily reproduce the dust crash with the following script. One problem is the length of the file names and the number of sub-directories.
`` #!/bin/bash
BASE_DIR="/files"
NUM_DIRS=10000
NUM_FILES=10000
FILENAME_LENGTH=50
generate_random_string() {
local length=$1
tr -dc A-Za-z0-9 </dev/urandom | head -c $length
}
create_structure() {
local current_depth=$1
local current_dir=$2
if [ $current_depth -gt 10 ]; then
return
fi
for ((i=0; i<$NUM_DIRS; i++)); do
dir_name=$(generate_random_string $FILENAME_LENGTH)
new_dir="$current_dir/$dir_name"
mkdir -p "$new_dir"
for ((j=0; j<$NUM_FILES; j++)); do
file_name=$(generate_random_string $FILENAME_LENGTH)
touch "$new_dir/$file_name"
done
create_structure $((current_depth + 1)) "$new_dir"
done
}
create_structure 1 "$BASE_DIR"
``
Is that only on windows ?
I tried the above on my linux box and it was dust handled it ok.
not only on Windows.. it also happens on Linux on VM with 64G RAM .
I have a 300TB volume on linux with billions of files. It takes 30GB RESS + 170GB kmem and goes OOM for the container. I limited the depth to 3 so theoretically it can be done in as little memory as the number of directories smaller than 3 depth.
I am using parallel du -hs ::: */*/* instead and it works quite fine (the catch is the workload is not balanced between processes and the last, largest directory takes a long time).
I don't think this is possible to fix. - du runs and dumps its output as it runs. dust loads it all into memory to make a decision. If there is too much to load dust will run out of memory.