FilePlugin suddenly crashes on Linux
I have now been observing this multiple times in my self-built squeak.cog.spur VM (767386b5c) on Ubuntu 18: Suddenly, the sources files are closed, so I can browse decompiled methods in the image, and opening every new method in a browser takes 5 seconds because the system apparently still tries to open the system. Smalltalk closeSourceFiles; openSourceFiles does not fail, but also does not help. FileList is empty, primitiveDirectoryLookup (for FileDirectory default entries) fails without an error code. No output in the console. I have enough free RAM and there are no suspicious processes in my image (just some webserver/client stuff).
Does anybody have seen this issue before or has an idea how to fix this?
I would suspect running out of file descriptors. There is an assumption in the image that file descriptors can be closed lazily when an attempt to open a file fails. I don’t think the same effort to close all old files is made when the directory primitives (which at least on Unix require a file descriptor) fail.
In general we should be closing old files descriptors using ephemerons and incremental old space collection. I (we?) urgently need to productize Tom Braun’s work on the incremental collector.
Also, CurrentReadOnlySourceFiles is an egregious bug. The idea here (see squeak-dev discussion between Vanessa and me) is that using a separate file for each method source access insulates eg the debugger from using a single pair of readonly sources and changes files. The issue to be avoided is that when someone is debugging/exploring source fetching for a given method, if there is no copy then as the debugger fetches source for the methods used by the file system so the same file used to fetch the source for the method the explorer chose is repositioned and the debugging/exploration is invalidated. An inverse heisenbug.
But instead of creating a separate copy for each method source retrieval, chewing through file descriptors as if there was no tomorrow, the debugger installed a new copy around its own method source retrieval, we could share a single copy of the source files, and get rid of the truly horrible CurrentReadOnlySourceFiles cacheDuring: crap.
This does not sound like a case of running out of file descriptors, but that can be verified by watching the /proc file system. Here is one way to monitor file handle leaks while you are working on something else:
proc := [ [ Transcript showln: DateAndTime now asString, ' Number of open VM file descriptors: ', (OSProcess outputOf: 'ls /proc/' , OSProcess thisOSProcess pid asString, '/fd | wc -l ') asInteger asString. (Delay forSeconds: 3) wait ] repeat ] fork.
proc terminate.
I have been observing this issue more frequently recently while working witha lot of open network sockets. The source files fail as soon as there are exactly 1024 open handles.
Relatively young image:
christoph@tr-medion:~$ ls -lisa /proc/2279917/fd
total 0
25852091 0 dr-x------ 2 christoph christoph 0 Sep 19 14:40 .
25688342 0 dr-xr-xr-x 9 christoph christoph 0 Sep 19 04:54 ..
25852092 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 0 -> /dev/pts/2
25852093 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 1 -> /dev/pts/2
25870682 0 l-wx------ 1 christoph christoph 64 Sep 19 14:59 10 -> 'pipe:[25862432]'
25852099 0 lr-x------ 1 christoph christoph 64 Sep 19 14:40 11 -> 'pipe:[25862433]'
25852100 0 lr-x------ 1 christoph christoph 64 Sep 19 14:40 12 -> 'pipe:[25657693]'
25870683 0 lrwx------ 1 christoph christoph 64 Sep 19 14:59 13 -> 'socket:[25863033]'
25852101 0 lr-x------ 1 christoph christoph 64 Sep 19 14:40 14 -> 'pipe:[25862434]'
25852094 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 2 -> /dev/pts/2
25852095 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 3 -> 'anon_inode:[eventpoll]'
25852096 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 4 -> 'socket:[25655475]'
25852097 0 lr-x------ 1 christoph christoph 64 Sep 19 14:40 5 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/SqueakV60.sources
25852098 0 lrwx------ 1 christoph christoph 64 Sep 19 14:40 6 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/Spur64VMMaker.changes
25870680 0 lrwx------ 1 christoph christoph 64 Sep 19 14:59 7 -> 'socket:[25860673]'
25870681 0 lrwx------ 1 christoph christoph 64 Sep 19 14:59 9 -> 'socket:[25863551]'
When it fails:
christoph@tr-medion:~$ ls -lisa /proc/2279917/fd
total 0
25951580 0 dr-x------ 2 christoph christoph 0 Sep 19 16:41 .
25688342 0 dr-xr-xr-x 9 christoph christoph 0 Sep 19 04:54 ..
25951585 0 lrwx------ 1 christoph christoph 64 Sep 19 16:41 0 -> /dev/pts/2
25951586 0 lrwx------ 1 christoph christoph 64 Sep 19 16:41 1 -> /dev/pts/2
25951595 0 l-wx------ 1 christoph christoph 64 Sep 19 16:41 10 -> 'pipe:[25862432]'
25951685 0 lr-x------ 1 christoph christoph 64 Sep 19 16:41 100 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/Spur64VMMaker.changes
25952585 0 lr-x------ 1 christoph christoph 64 Sep 19 16:41 1000 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/Spur64VMMaker.changes
--- truncated 1017 more entries for both change files ---
25952583 0 lr-x------ 1 christoph christoph 64 Sep 19 16:41 998 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/Spur64VMMaker.changes
25952584 0 lr-x------ 1 christoph christoph 64 Sep 19 16:41 999 -> /home/christoph/git/OpenSmalltalk/opensmalltalk-vm/image/SqueakV60.sources
Any ideas how to proceed?
a) Use ephemerons to finalize files, not copies. This should allow files to be flushed, closed, and reopened when file descriptors are exhausted. Implement a last used timestamp (microsecond clock) in files so that when descriptors are exhausted the least recently used are the ones closed. When the primitives that answer new file descriptors fail a GC is run and the primitive repeated. If the repeat fails then the least recently used Nth of files are temporarily closed.
b) implement my scheme for getting rid of CurrentReadOnlySourceFiles. This scheme exists only to arrange that the debugger does not interfere with source file access when debugging source file access. By arranging that each method source access is done with a new read-only source file the debugger cannot interfere with a source file access as it fetches the source of methods involved in source file access. But this is incredibly profligate with file descriptors and requires the horribly ugly CurrentReadOnlySourceFiles hack. Instead it should be the debugger (and AFAICT only the debugger) that should install and remove via ensure: a copy of the source file around each source access it makes, hence insulating the code being debugged from using the same file as the rest of the system.
To allow the debugger to debug the debugger, and to reduce the number of file descriptors created, each instance of the debugger should have its own copies of the two source files, which it installs and removes around each source access. In this way in a normal system we only have as many source files pairs (source & changes) as one plus the number of debuggers.
CurrentReadOnlySourceFiles can be kept around as a noop for a few months before being deprecated.
The CurrentReadOnlySourceFiles issue needs to be addressed
So independently of the recommended optimizations on the image side, you don't consider this a bug in the VM?
So independently of the recommended optimizations on the image side, you don't consider this a bug in the VM?
Since there's nothing the VM can do about the image asking for more file descriptors than the OS can provide, then no; instead it is a bug above the VM.
So independently of the recommended optimizations on the image side, you don't consider this a bug in the VM?
Since there's nothing the VM can do about the image asking for more file descriptors than the OS can provide, then no; instead it is a bug above the VM.
We could call setrlimit with something much higher than 1024, provided that no small hard limit is set up globally.
I have no way to know if this issue #732 is related to issue #684 (https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/684) but I can say for certain that there is a problem with socket handle leaks in certain network/server configurations on certain versions of the unix VM (presumably including the latest), that it is very hard to reproduce, that it is definitely a VM problem, and that anything we do on the image side to "fix" it would presumably only serve to further mask the root cause of the problem.
FYIO today I experienced the same issue with an older VM (202205110711/VMMaker.oscog-mt.318, shame on me for not having replaced it a newer one), so this is at least not a recent regression.
Just saw this again with a VM from 202312181441/VMMaker.oscog-eem.3339 and a trunk image updated somewhere in July. Maybe that helps us narrow it down. Similar pattern as before: Ubuntu 22 (different machine), 1081 open file handles, thereof ~47% SqueakV50.sources, ~47% my changes file. Plus a bunch of .so files and application-specific log files and TCP ports. StandardFileStream registry listed the changes and sources files just once.