Stale file handle issues
Seen with: duc --version duc version: 1.4.5 options: cairo x11 ui tokyocabinet
Command line: duc index -d /tmp/redacted.duc /proj/redacted
General description of problem: While running duc in a UNIX/NFS environment with multiple computers using a Dell Isilon filer, I get into a situation where duc seems to lose track of the directory it is in. This problem reproduces only when my continuous integration server is running a regression and creating and deleting files from the path, but I am having trouble providing a testcase for easy reproduction for you. What I see in the duc output is something like:
Redacted log contents:
Cannot determine realpath of: relative_build_dir
skipping /proj/redacted/lots/of/different/paths/relative_build_dir: No such file or directory
Error statting other_relative_file: No such file or directory
[Repeats for several other files]
Error statting different_relative_file: Stale file handle
[Repeats for several other files]
Cannot determine realpath of: relative_dir
[Lots of different versions of the above errors for various different paths/files]
Error statting lots: Stale file handle
skipping /proj/redacted: Stale file handle
In my NFS environment, I have a lot of Continuous integration jobs running in the background which are deleting directories off the filer. This creates situations where whole directory trees will just disappear as the duc job runs. Not ideal, but also not much I can do about it.
My analysis FWIW To me what it looks like is happening is that: duc is keeping track of its current directory, the directory it is in gets deleted underneath it, it then tries to use relative paths to recover, and since the directory it is in is now gone, it loses track of where it was and is unable to scan any more of the drive because it is off in the weeds with a series of stale file handles. This is the best description of the problem I can come up with based on the data presented without a deep div into the duc source code.
"ech3" == ech3 @.***> writes:
Seen with: /tools/bin/duc --version duc version: 1.4.5 options: cairo x11 ui tokyocabinet
Can you try with the latest 1.5.0-rc1 tag and see if it is still doing this? I suspect so...
General description of problem:
While running duc in a UNIX/NFS environment with multiple computers using a Dell Isilon filer, I get into a situation where duc seems to lose track of the directory it is in. This problem reproduces only when my continuous integration server is running a regression and creating and deleting files from the path, but I am having trouble providing a testcase for easy reproduction for you. What I see in the duc output is something like:
So I think your analysis is pretty spot on, and duc needs to be a bit more tolerant of this situation. I might be able to whip up a test case where I have a bunch of deep directory trees with lots of leaves being added and deleted while going through stuff. Especially if I put in a slight delay between each loop over a directory entry. The idea being to delete as much of the tree when duc is in it.
Or just keep moving up if you get failures when you try to move up to your parent directory, so you slice that part off and keep moving up.
So I'd probably want to do some research on how other tools handle this gracefully. I could see that if you get lost, or your CWD goes away from underneath you, then you need to just go back to the root and start over. And probably prune those entries if not found again.
But this is going to be a hard case to handle well. I'll have to spend some time thinking on this and looking at the code and coming up with a test case to see what needs to happen.
Redacted log contents:
Cannot determine realpath of: relative_build_dir skipping /proj/redacted/lots/of/different/paths/relative_build_dir: No such file or directory Error statting other_relative_file: No such file or directory [Repeats for several other files] Error statting different_relative_file: Stale file handle [Repeats for several other files] Cannot determine realpath of: relative_dir [Lots of different versions of the above errors for various different paths/files] Error statting lots: Stale file handle skipping /proj/redacted: Stale file handle
Can you give me an idea (roughtly) of how many files and directories you have on your NFS server in this tree? And how large it is?
Another possible thought would be to take a snapshot and then have duc run on the snapshot instead of the main directory tree, since you know that will be stable. We'd have to think of a way to specify the way to strip leading entries off the setup though.
I'm much more familiar with Netapps and their NFS setup. so you could search:
/path/to/dir/.snapshot/daily.0/
and then you'd need some way to just keep the /path/to/dir/ and get rid of the last two entries in the path.
In my NFS environment, I have a lot of Continuous integration jobs running in the background which are deleting directories off the filer. This creates situations where whole directory trees will just disappear as the duc job runs. Not ideal, but also not much I can do about it.
Yeah, it's a tough situation to handle.
My analysis FWIW
To me what it looks like is happening is that: duc is keeping track of its current directory, the directory it is in gets deleted underneath it, it then tries to use relative paths to recover, and since the directory it is in is now gone, it loses track of where it was and is unable to scan any more of the drive because it is off in the weeds with a series of stale file handles. This is the best description of the problem I can come up with based on the data presented without a deep div into the duc source code.
You've done a great job helping here! I suspect we need to do a full path lookup when a relative path lookup fails to figure out if it's been deleted or not.
Gotta spend some time thinking on this, but thanks for the report!
Can you try with the latest 1.5.0-rc1 tag and see if it is still doing this? I suspect so...
Will do, This will just take a bit to get set up and run.
Can you give me an idea (roughtly) of how many files and directories you have on your NFS server in this tree? And how large it is?
Reported from an Isilon tool: Disk usage: 1.62 TB Disk limit: 1.95 TB Files Used: 6286952
$ df -h /proj/redacted Filesystem Size Used Avail Use% Mounted on server:/path/to/redacted 2.0T 1.7T 342G 83% /proj/redacted
$ duc info -d /proj/redacted/redacted.duc Date Time Files Dirs Size Path 2025-03-26 04:03:23 2.0M 253.1K 273.6G /proj/redacted
I'm much more familiar with Netapps and their NFS setup. so you could search: /path/to/dir/.snapshot/daily.0/
We used to use Netapps sigh, and we have a similar snapshot setup. I just would prefer avoid that since this is supposed to be giving an accounting of the "current" system usage. However, your suggestion is a better solution than anything I can think of. I will investigate what I can do on this front.
Gotta spend some time thinking on this, but thanks for the report!
Understood. I was hoping that the "solution" was as soon as you get a Stale file handle duc would fall back from using a relative file handle to using an absolute one and see if that worked any better. However, my projects tend to have a lot of technical debt in them that I am still working off, and you may not want to band-aid it this way.
I did manage to get 1.5.0-rc1 to compile:
$ ./duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tokyocabinet
I use a helter-skelter mix of tools (autoconf 2.71 is what I used to generate the configure), and I was unable to get tkrw to link properly. The configure didn't show TKRZW_LIBS/CFLAGS in the configure --help output. Not sure if this is something on my side of the fence or not. I am not root, so I can't install things system wide and have to do a lot of duct tape type solutions to get things working. This is a release candidate, so I don't expect anything is working, back to the problem at hand:
A run was done with both the 1.4.5 and the 1.5.0-rc1 and neither reproduced the issue. Our filer performance is double what it was when the problem was happening yesterday. So maybe someone around here sacrificed more bits to the filer gods and I was not informed. /s
I will keep trying to see if I can reproduce the problem...
I tried your suggestion of using the snapshot directory. The problem is that the path gets injected into the CGI output/path and it makes things kind of ugly. So the path specified in the URL looks something like:
/proj/redacted/.snapshot/proj012_10pm_daily_03-26-2025_22:00
Maybe I can change the cgi wrapper I have to massage the path, but that will require some more investigation.
"ech3" == ech3 @.***> writes:
I did manage to get 1.5.0-rc1 to compile: $ ./duc --version duc version: 1.5.0-rc1 options: cairo x11 ui tokyocabinet
Excellent! But I see you had issues with tkrwz, oh well.
I use a helter-skelter mix of tools (autoconf 2.71 is what I used to generate the configure), and I was unable to get tkrw to link properly. The configure didn't show TKRZW_LIBS/CFLAGS in the configure --help output. Not sure if this is something on my side of the fence or not. I am not root, so I can't install things system wide and have to do a lot of duct tape type solutions to get things working. This is a release candidate, so I don't expect anything is working, back to the problem at hand:
If you send some details on how you compiled tkrwz and the errors you got, I might be able to help. But it hsould just be:
./configure --prefix /proj/tools
make
make install
assuming you have write access to /proj/tools/ directory and you can create files and directories in there. Or just use your home directory.
A run was done with both the 1.4.5 and the 1.5.0-rc1 and neither reproduced the issue. Our filer performance is double what it was when the problem was happening yesterday. So maybe someone around here sacrificed more bits to the filer gods and I was not informed. /s
Hah! Sounds like people were beating on the Isilon yesterday and now it has more performance room available. These are the worst problems to diagnose.
I will keep trying to see if I can reproduce the problem...
Thanks! I suspect if you just create a create and a bunch of files, then start adding/deleteing files and directories while doing a duc scan in another window it might re-create at some point. Make sure you create a deep tree, so you have a better chance of catching out duc.
I tried your suggestion of using the snapshot directory. The problem is that the path gets injected into the CGI output/path and it makes things kind of ugly. So the path specified in the URL looks something like:
/proj/redacted/.snapshot/proj012_10pm_daily_03-26-2025_22:00
Yeah, that's ugly.
Maybe I can change the cgi wrapper I have to massage the path, but that will require some more investigation.
That might be the simplest option here actually, to just add an option to the CGI script to strip out a string in the path when it's displayed. But that will probably break the CGI script unless it knows to put that back when doing the duc DB queries.
I'm not sure I'll have time this weekend to fix this, or even come up with a good solution. $WORK has gotten in my way today.
John
If you send some details on how you compiled tkrwz and the errors you got, I might be able to help. But it hsould just be:
Will do when I get a chance.
Hah! Sounds like people were beating on the Isilon yesterday and now it has more performance room available. These are the worst problems to diagnose.
Yeah, this is going to be nasty. I have tried to reproduce this several times, but each time I was unable to. I run this on a lot of filer mounts, and I am still seeing this on 1.4.5, but never in the same filer mount twice. I would rather not run my whole system with 1.5.0-rc1, but I am not sure I will be able to reproduce it short of that.
Thanks! I suspect if you just create a create and a bunch of files, then start adding/deleteing files and directories while doing a duc scan in another window it might re-create at some point. Make sure you create a deep tree, so you have a better chance of catching out duc.
My plan was to take the Linux kernel and delete a portion of the tree while it was running, but that didn't seem to work when I did it in two terminal windows on the same machine. I think the combination of a slow filer sending out deletion notifications (I don't know NFS that well so am guessing) back to the OS is causing it. I would like to get duc into a situation where I could pause it temporarily (Ctrl-Z) in a portion of the heirarchy and then rug pull the directory and then foreground it and see what happens. There's probably a way to get duc to print out where it is in the scan specifically, but I would need to be able to dedicate some time to get that set up if I have to hack it in.
I'm not sure I'll have time this weekend to fix this, or even come up with a good solution. $WORK has gotten in my way today.
Yeah, I get it. I am not in a better way myself with $WORK. As always, I am thankful for anyone's time that they can dedicate to looking at this. duc is a really great product IMHO, and someone like me is only going to complain when I am actually using the product. Whether I am abusing the product is up for debate. Thanks again, and don't ruin your weekend on my account.
I believe I was able to reproduce the issue. I have also included information about the steps I took for the tkrzw issue if you want to see what I did.
Instructions for reproducing original issue
- Run the commands:
wget "https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.14.tar.xz"
tar xvf linux-6.14.tar.xz
- Run the following command and Ctrl-Z after about a second:
duc index --debug -v -d linux.duc linux-6.14
- The output will look something like:
>> devices
2923 24576 cma3000_d0x.rst
8321 24576 xpad.rst
22441 24576 elantech.rst
4100 24576 walkera0701.rst
^Z
[6]+ Stopped duc index --debug -v -d linux.duc linux-6.14
- Based on the output above, I chose the last file it mentioned (in this case walkera0701.rst), and I tried to find it in the Linux kernel's tree:
$ find linux-6.14/ -name walkera0701.rst
linux-6.14/Documentation/input/devices/walkera0701.rst
- Based on it's position in the tree, I deleted the directory under it:
rm -rf linux-6.14/Documentation/input/
- Foreground the duc process and watch as the problem happens:
$ fg
duc index --debug -v -d linux.duc linux-6.14
2102 24576 edt-ft5x06.rst
Error statting pxrc.rst: No such file or directory
Error statting sentelic.rst: No such file or directory
[SNIP]
Error statting security: No such file or directory
Error statting .gitattributes: No such file or directory
skipping /proj/redacted/linux-6.14: Stale file handle
<< /proj/redacted/linux-6.14 actual:44599780 apparent:264822784
Indexed 9467 files and 480 directories, (42.5MB apparent, 252.6MB actual) in 1 minutes, and 6.02 seconds.
- Restore the kernel area in question before re-running since you may not be in a subdirectory:
tar xvf linux-6.14.tar.xz linux-6.14/Documentation
or the safer:
tar xvf linux-6.14.tar.xz
I was able to reproduce the issue on both 1.4.5 and 1.5.0-rc1, but it took a couple of tries with 1.5.0-rc1 and I needed to delete a few directories up from where it was (for linux-6.14/drivers/net/ethernet/ethoc.c I deleted linux-6.14/drivers/*) in order for it to reproduce. I am not sure if there is a more reliable way to reproduce it than this, but it was the best I could come up with.
Instructions for tkrzw issue
- Run the following commands:
wget https://dbmx.net/tokyocabinet/tokyocabinet-1.4.48.tar.gz
tar xvf tokyocabinet-1.4.48.tar.gz
cd tokyocabinet-1.4.48
./configure
make -j2
git clone https://github.com/estraier/tkrzw.git
cd tkrzw
./configure
make -j2
wget https://github.com/zevv/duc/archive/refs/tags/v1.5.0-rc1.tar.gz
tar xvf v1.5.0-rc1.tar.gz
cd duc-1.5.0-rc1
# There's no configure in the tar.gz, so I have to build it:
aclocal
autoreconf -i
- A default config results in:
$ export PKG_CONFIG_PATH="../tkzrw:../tokyocabinet-1.4.48:/usr/lib/pkgconfig"
$ ./configure
[SNIP]
Selected backend tkrzw
checking for tkrzw_get_last_status in -ltkrzw... no
configure: error: Unable to find tkrzw
- If I try to use tokyocabinet:
./configure --with-db-backend=tokyocabinet
Everything works correctly. The PKG_CONFIG_PATH is needed in step 2 otherwise it won't compile.
- The TKRZW doesn't seem to have the appropriate variables that Tokyo Cabinet does:
$ ./configure --help | grep TK
$ ./configure --help | grep TC
TC_CFLAGS C compiler flags for TC, overriding pkg-config
TC_LIBS linker flags for TC, overriding pkg-config
$ cat /etc/issue
Red Hat Enterprise Linux release 8.10 (Ootpa)
Since there isn't a configure in the release tarball, this maybe something to do with my automake tool chain.
Thanks for the reproduction info, this will help. Haven't touched this yet, been busy and now I've got a cold. Ugh.
I suspect the real answer might be to fail gracefully back upwards, deleting all the data we've gotten so far, until we get back to where we're able to continue. Since it's a recursive function, that would seem to be the easiest, as long as we can pass back a proper error code for this situation.
Need to think more on this. John
So I've totally spaced on looking into this issue, but some other stuff came up and I'm getting ready to push out 1.5.0-rc2 with some other fixes and updates. So I don't have a solution to your 'duc loses it's mind when you delete directory trees underneath it while running' but I'm also sure the tkrwz issue is fixed if you just install it after you compile it, before you try to configure and compile duc.
Can you open a seperate issue for this please so I can track it. It might be simpler over all to just compile tkrwz directly into duc. Not sure.
Can you open a seperate issue for this please so I can track it. It might be simpler over all to just compile tkrwz directly into duc. Not sure.
I have created issue #341 about the tkrzw issue.
So I've totally spaced on looking into this issue
No worries. If it was bad enough I would propose a fix, but I am not in that place currently.