Hung?
Current status (running for the past 24 hrs)
(DRYRUN MODE) Now scanning "/nfs", found 6924758 files. (DRYRUN MODE) Now have 6924758 files in total. (DRYRUN MODE) Removed 0 files due to nonunique device and inode. (DRYRUN MODE) Total size is 75907323718686 bytes or 69 TiB Removed 292875 files due to unique sizes from list.6631883 files left. (DRYRUN MODE) Now eliminating candidates based on first bytes: removed 365136 files from list.6266747 files left. (DRYRUN MODE) Now eliminating candidates based on last bytes removed 4229642 files from list.2037105 files left.
Now it's been sitting here for the last 20 hours with this message:
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:
I get that calculating checksums is a more intensive process, and 2 million files is not exactly a walk in the park, but...
CPU utilization is holding at 5%, memory utilization is not changing anymore (and it did in the previous stages, while processing.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2257015 5.0 1.8 2659852 2440212 pts/0 D+ Sep20 90:31 rdfind -n true /nfs
I'm also experiencing an rdfind instance stalled on, in my case, "eliminating candidates based on first bytes".
I didn't have that much patience, but I do see that it's not using any CPU or I/O.
Assuming it is hung, I set about to abort it - only to find that it doesn't react to any of Strg-Z, kill -STOP, kill -TERM or even kill -KILL...
The filesystem (an external NTFS drive) I'm trying to run rdfind on is still accessible normally, so whatever happened, it's not caused by or related to filesystem or device access in general.
I'll have to reboot now anyway to be able to unmount that external drive, I'll see whether the same thing happens again and hopefully be prepared a little bit better for it (in the sense of trying to attach a debugger / trying to figure out what/where rdfind gets stuck on).
... it happened again, this time a bit later during "eliminating candidates based on last bytes".
gdb also hangs when trying to attach to the stalled rdfind process. I see no smoking guns (no messages at all correlating with the time when rdfind stalled) in dmesg. (A third-party pstack tool also hangs, with the slightly more valuable diagnostic of "LWP pid cannot be stopped: Operation not permitted".
ps shows that process in state D+, i.e. uninterruptible sleep. I'm running kernel 6.6.52 on x86_64.
There is a comment in rdutil.h :
// if there is trouble with too much disk reading, sleeping for nsecsleep
// nanoseconds can be made between each file.
This is via the -sleep command line option, e.g. -sleep 1ms. Unfortunately, the smallest sleep time is 1ms, the code uses nanoseconds so it'd be nice if the sleep time could be reduced.
I wonder with the behavior described if the sleep time became corrupt and went to infinity.
All the "stages" (first bytes / last bytes / etc) call the same underlying code in Fileinfo.cc, Fileinfo::fillwithbytes, which does:
std::fstream f1;
f1.open(m_filename.c_str(), std::ios_base::in);
Unfortunately the code doesn't check the stream status nor clear errors, which might result in a hang (e.g. as I found described here).
Also note the f1.open call does NOT open the file in binary mode!
I think this issue and others complaining that the program gets hung or is slow is because even if you have a fast hashing algorithm what ends up being the bottleneck is the read speed of the device the file is on. With HDDs and SATA the max read speed I get is around 230mb/s. With multiple TB this ends up taking quite a while. If your drives are USB based this may be even slower.
I added a -partialchecksum feature in my fork and also merged in @entrity's PR that adds the -progress option.