clamav
clamav copied to clipboard
clamd becomes zombie process after some time
Hey folks,
In testing this image out for an application, I'm noticing that clamd just stops after some time and ps lists it as a zombie:
clamav-6565b79d54-ljft4:/# ps
PID USER TIME COMMAND
1 root 0:00 {init} /sbin/tini /bin/sh /init
6 root 0:01 tail -f /dev/null
15 root 0:45 [clamd]
88 clamav 0:01 freshclam --checks=1 --daemon --foreground --stdout --user=clamav
5176 root 0:00 /bin/sh -l
5194 root 0:00 ps
I noticed this when a container would randomly stop responding after some time. There was a failed update before the last error state we had, perhaps that's something to do with it. Perhaps there is an issue with not using wait
in the init script? See the logs below:
Updating initial database
Fri Oct 15 04:40:03 2021 -> *Current working dir is /var/lib/clamav/
Fri Oct 15 04:40:03 2021 -> *Loaded freshclam.dat:
Fri Oct 15 04:40:03 2021 -> * version: 1
Fri Oct 15 04:40:03 2021 -> * uuid: 93adae7f-7a75-4868-a391-dc7a8556d7e6
Fri Oct 15 04:40:03 2021 -> ClamAV update process started at Fri Oct 15 04:40:03 2021
Fri Oct 15 04:40:03 2021 -> *Current working dir is /var/lib/clamav/
Fri Oct 15 04:40:03 2021 -> *Querying current.cvd.clamav.net
Fri Oct 15 04:40:03 2021 -> *TTL: 1800
Fri Oct 15 04:40:03 2021 -> *fc_dns_query_update_info: Software version from DNS: 0.103.3
Fri Oct 15 04:40:03 2021 -> *Current working dir is /var/lib/clamav/
Fri Oct 15 04:40:03 2021 -> *check_for_new_database_version: Local copy of daily found: daily.cld.
Fri Oct 15 04:40:03 2021 -> *query_remote_database_version: daily.cvd version from DNS: 26322
Fri Oct 15 04:40:03 2021 -> daily database available for update (local version: 26321, remote version: 26322)
Fri Oct 15 04:40:05 2021 -> *Retrieving https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download source: https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download destination: ./clamav-ec00de4a4db750947e6453c27a3bf3c4.tmp
* Trying 104.16.218.84:443...
* connect to 104.16.218.84 port 443 failed: Connection refused
* Trying 104.16.219.84:443...
* connect to 104.16.219.84 port 443 failed: Connection refused
* Trying 2606:4700::6810:da54:443...
* Immediate connect fail for 2606:4700::6810:da54: Address not available
* Trying 2606:4700::6810:db54:443...
* Immediate connect fail for 2606:4700::6810:db54: Address not available
* Failed to connect to database.clamav.net port 443 after 97 ms: Connection refused
* Closing connection 0
Fri Oct 15 04:40:05 2021 -> ^Download failed (7) Fri Oct 15 04:40:05 2021 -> ^ Message: Couldn't connect to server
Fri Oct 15 04:40:05 2021 -> ^downloadPatch: Can't download daily-26322.cdiff from https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *Retrieving https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download source: https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download destination: ./clamav-3a6021abe05617356df92e020c478e27.tmp
* Trying 104.16.218.84:443...
* connect to 104.16.218.84 port 443 failed: Connection refused
* Trying 104.16.219.84:443...
* connect to 104.16.219.84 port 443 failed: Connection refused
* Trying 2606:4700::6810:da54:443...
* Immediate connect fail for 2606:4700::6810:da54: Address not available
* Trying 2606:4700::6810:db54:443...
* Immediate connect fail for 2606:4700::6810:db54: Address not available
* Failed to connect to database.clamav.net port 443 after 6 ms: Connection refused
* Closing connection 0
Fri Oct 15 04:40:05 2021 -> ^Download failed (7) Fri Oct 15 04:40:05 2021 -> ^ Message: Couldn't connect to server
Fri Oct 15 04:40:05 2021 -> ^downloadPatch: Can't download daily-26322.cdiff from https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *Retrieving https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download source: https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> *downloadFile: Download destination: ./clamav-40d489cd516e734f7ad5a69147e798ac.tmp
* Trying 104.16.219.84:443...
* connect to 104.16.219.84 port 443 failed: Connection refused
* Trying 104.16.218.84:443...
* connect to 104.16.218.84 port 443 failed: Connection refused
* Trying 2606:4700::6810:db54:443...
* Immediate connect fail for 2606:4700::6810:db54: Address not available
* Trying 2606:4700::6810:da54:443...
* Immediate connect fail for 2606:4700::6810:da54: Address not available
* Failed to connect to database.clamav.net port 443 after 14 ms: Connection refused
* Closing connection 0
Fri Oct 15 04:40:05 2021 -> ^Download failed (7) Fri Oct 15 04:40:05 2021 -> ^ Message: Couldn't connect to server
Fri Oct 15 04:40:05 2021 -> ^downloadPatch: Can't download daily-26322.cdiff from https://database.clamav.net/daily-26322.cdiff
Fri Oct 15 04:40:05 2021 -> The database server doesn't have the latest patch for the daily database (version 26322). The server will likely have updated if you check again in a few hours.
Fri Oct 15 04:40:05 2021 -> *fc_update_database: daily.cld already up-to-date.
Fri Oct 15 04:40:05 2021 -> *Current working dir is /var/lib/clamav/
Fri Oct 15 04:40:05 2021 -> *check_for_new_database_version: Local copy of main found: main.cld.
Fri Oct 15 04:40:05 2021 -> *query_remote_database_version: main.cvd version from DNS: 62
Fri Oct 15 04:40:05 2021 -> main.cld database is up-to-date (version: 62, sigs: 6647427, f-level: 90, builder: sigmgr)
Fri Oct 15 04:40:05 2021 -> *fc_update_database: main.cld already up-to-date.
Fri Oct 15 04:40:05 2021 -> *Current working dir is /var/lib/clamav/
Fri Oct 15 04:40:05 2021 -> *check_for_new_database_version: Local copy of bytecode found: bytecode.cvd.
Fri Oct 15 04:40:05 2021 -> *query_remote_database_version: bytecode.cvd version from DNS: 333
Fri Oct 15 04:40:05 2021 -> bytecode.cvd database is up-to-date (version: 333, sigs: 92, f-level: 63, builder: awillia2)
Fri Oct 15 04:40:05 2021 -> *fc_update_database: bytecode.cvd already up-to-date.```
The same here in GKE environment.
also running into this issue in EKS
Same happens to me:
- just after
freshclam
notifiesclamd
about the update - just after
clamscan
command runs successfully
Both happens in the container.
My current solution for the first one is building another image based on clamav/clamav
and running freshclam
as a standalone command in the build process. (Also deploying this image daily to get last updates)
I experienced the same in a Kubernetes cluster.
In my case the clamd
process ran OOM and got killed by Kubernetes. I could solve it by increasing the memory limits in the Kubernetes deployment.
To debug this, I started the POD with CLAMAV_NO_FRESHCLAMD="true"
and CLAMAV_NO_CLAMD="true"
and then first started clamd
manually in a shell and after it started up, I sent the "RELOAD" command via netcat to trigger a database reload.
The relevant strace part:
poll([{fd=6, events=POLLIN}], 1, 600000) = 1 ([{fd=6, revents=POLLIN}])
read(6, "\0", 1025) = 1
poll([{fd=6, events=POLLIN}, {fd=10, events=POLLIN}], 2, 30000) = 1 ([{fd=10, revents=POLLIN}])
recvmsg(10, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="RELOAD\0", iov_len=4104}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 7
sendto(10, "RELOADING\n", 10, 0, NULL, 0) = 10
shutdown(10, SHUT_RDWR) = 0
close(10) = 0
open("/var/lib/clamav", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 10
fcntl(10, F_SETFD, FD_CLOEXEC) = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efe88e02000
getdents64(10, 0x7efe88e020a8 /* 6 entries */, 2048) = 184
stat("/var/lib/clamav/bytecode.cvd", {st_mode=S_IFREG|0644, st_size=293670, ...}) = 0
stat("/var/lib/clamav/main.cvd", {st_mode=S_IFREG|0644, st_size=170479789, ...}) = 0
stat("/var/lib/clamav/daily.cld", {st_mode=S_IFREG|0644, st_size=181806592, ...}) = 0
getdents64(10, 0x7efe88e020a8 /* 0 entries */, 2048) = 0
close(10) = 0
munmap(0x7efe88e02000, 12288) = 0
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efe88d8b000
mprotect(0x7efe88d8d000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], ~[HUP INT ILL BUS FPE KILL SEGV USR2 PIPE TERM CONT STOP TSTP RTMIN RT_1 RT_2], 8) = 0
clone(child_stack=0x7efe88dadab8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[45], tls=0x7efe88dadb38, child_tidptr=0x7efed60f2f90) = 45
rt_sigprocmask(SIG_SETMASK, ~[HUP INT ILL BUS FPE KILL SEGV USR2 PIPE TERM CONT STOP TSTP RTMIN RT_1 RT_2], NULL, 8) = 0
poll([{fd=6, events=POLLIN}], 1, 600000Thu Feb 17 12:33:10 2022 -> Reading databases from /var/lib/clamav
<unfinished ...>) = ?
+++ killed by SIGKILL +++
It shows that the process received a SIGKILL signal (which is because it exceeded its memory limits in my case).
But there seems to be bug in the Docker image here because in an OOM situation, the container should be killed (and restarted). But instead the clamd
process becomes a zombie and so the container keeps running in a defunct state.
I don't know tini
but it might be at fault of not killing the process properly and leave the zombie instead. IMO this should be fixed as OOM might happen or also other reasons which lead to killing the clamd
process.
For the meantime, I configured a LivenessProbe in Kubernetes using the clamdcheck.sh
script (from the Docker health check command).
@eht16 maybe the health check issue is related to https://github.com/Cisco-Talos/clamav/pull/380 and https://github.com/Cisco-Talos/clamav/pull/369?
@jacobrayschwartz @denis111 @ysmall-backbase: Per the comment above, it may just be that you need to allocate more RAM for your containers.
Either that or disable the ConcurrentDatabaseReload
feature in the clamd.conf
so that it doesn't load up two engines (2x as much RAM used) during the reload: https://github.com/Cisco-Talos/clamav/blob/main/etc/clamd.conf.sample#L202-L210
@eht16 maybe the health check issue is related to #380 and #369?
@micahsnyder there is no health check issue for me, it works fine :D. Kubernetes does not support Docker health checks and so one has to define their own in Kubernetes but this is ok. For reference, the health check I'm using is:
livenessProbe:
exec:
command:
- clamdcheck.sh
initialDelaySeconds: 120
failureThreshold: 2
periodSeconds: 30
But I still see an issue here: if the clamd
process is killed for whatever reason, the whole container should die. The current behavior that the clamd
process is not properly reaped but gets a zombie is wrong, IMO.
Health checks help to detect and handle the situation. Anyway, in this case they just hide the underlying root cause that the container becomes unusable.
I've seen an similar issue at startup time when the virus database is out of date (>7 days old) with AKS. What seems to happen on an macro level is:
- the freshlcam invocation in /init doesn't get triggered because the signature directory is not empty (specifically /var/lib/clamav/main.cvd exists)
- freshclam gets run as shell background job and starts fetching the update
- clamd gets started as shell background job before freshclam has finished the update
- clamd gets upset by the out of date database (at least it looks like that, I've not looked with strace or similar yet) and exits
- freshclam eventually finishes but has no effect because clamd is dead already
- because /init started clamd as shell background job but doesn't look for an exit of clamd the failed clamd never gets reaped nor restarted.
If I don't use an volume for /var/lib/clamav I get the same issue on each start reliably because the signatures in the current 0.104 image are older than 7 days.
#563 I fixed it for myself by allowing freshclam to finish fetching updates before starting clamd. Only if CLAMAV_NO_FRESHCLAMD is set to false.
#563 I fixed it for myself by allowing freshclam to finish fetching updates before starting clamd. Only if CLAMAV_NO_FRESHCLAMD is set to false.
I'm not sure if this is the right solution. What this bug sounds like to me is that ClamD is becoming a zombie process when it runs out of memory because both it and Freshclam have loaded the databases. I don't know why ClamD is becoming a zombie instead of being killed, or why the container isn't killed.
@andremae's comment about the 0.104 image's databases being older than 7 days is a decent reason to merge @TairaSayo's PR anyways, so it can update before ClamD starts -- but running with old databases wouldn't cause ClamD to crash or give up or anything. It's fine to load older databases. ClamD would just reload later when the self-check happens and it sees that Freshclam updated the databases.
One reason I'm hesitant to merge the PR is that ClamD already takes a while to start up. Having it wait for a hard-coded 60 seconds before starting ClamD would mean a much longer start-up time. Maybe we could have it run freshclam twice, instead:
- The first time without --daemon mode and without backgrounding it.
- The second time the way it is before the PR.
This way ClamD will only wait as long as it takes for freshclam to finish updating. The second daemonized freshclam will do the DNS query to check for an update and will find it is up-to-date and then it will sleep until it's time to check again. I think that's fine.
But even with this change, I think we'll still run into this bug later when freshclam does an update while clamd is running and clamd reloads. There are two options to lower the RAM usage to resolve this:
- Disable database load-testing in freshclam.conf.
- Disable concurrent database reload in clamd.conf.
The second one is more intrusive because it means ClamD will block for 20-60 seconds depending on how fast the host is while ClamD reloads, once a day. But it uses more RAM than freshclam's database load-testing, so it is arguably more important if you don't have a lot of RAM for your containers.
Thoughts?
@micahsnyder you can control ClamD delay with the CLAMD_STARTUP_DELAY variable, but I agree that it could be easily missed. Do you think it is better to lower the default value to 6-10 seconds?
@TairaSayo sorry about the lag replying to you. Instead of having a startup delay, what I'm proposing is this change, so you run freshclam twice. The first time is non-daemonized and foreground (blocking). The second time is like normal.
if [ "${CLAMAV_NO_FRESHCLAMD:-false}" != "true" ]; then
echo "Updating databases before starting ClamD"
freshclam \
--checks="${FRESHCLAM_CHECKS:-1}" \
--foreground \
--stdout \
--user="clamav"
fi
if [ "${CLAMAV_NO_CLAMD:-false}" != "true" ]; then
echo "Starting ClamD"
if [ -S "/run/clamav/clamd.sock" ]; then
unlink "/run/clamav/clamd.sock"
fi
clamd --foreground &
while [ ! -S "/run/clamav/clamd.sock" ]; do
if [ "${_timeout:=0}" -gt "${CLAMD_STARTUP_TIMEOUT:=1800}" ]; then
echo
echo "Failed to start clamd"
exit 1
fi
printf "\r%s" "Socket for clamd not found yet, retrying (${_timeout}/${CLAMD_STARTUP_TIMEOUT}) ..."
sleep 1
_timeout="$((_timeout + 1))"
done
echo "socket found, clamd started."
fi
if [ "${CLAMAV_NO_FRESHCLAMD:-false}" != "true" ]; then
echo "Starting FreshclamD to check for additional updates in the background"
freshclam \
--checks="${FRESHCLAM_CHECKS:-1}" \
--daemon \
--foreground \
--stdout \
--user="clamav" \
&
fi
I'm also proposing changing the freshclam.conf
to have this:
TestDatabases no
so that freshclam doesn't use a bunch of RAM load-testing the databases after download
And changing the clamd.conf
to have this:
ConcurrentDatabaseReload no
so that ClamD doesn't have two databases loaded in memory at the same time during a reload -- which I suspect the process that is causing the zombie clamd-process issue.
You will have to modify this area to add these options https://github.com/Cisco-Talos/clamav/blob/main/Dockerfile#L56-L71
Something like this:
sed -e "s|^\(Example\)|\# \1|" \
-e "s|.*\(PidFile\) .*|\1 /run/lock/clamd.pid|" \
-e "s|.*\(LocalSocket\) .*|\1 /run/clamav/clamd.sock|" \
-e "s|.*\(TCPSocket\) .*|\1 3310|" \
-e "s|.*\(TCPAddr\) .*|#\1 0.0.0.0|" \
-e "s|.*\(User\) .*|\1 clamav|" \
-e "s|^\#\(LogFile\) .*|\1 /var/log/clamav/clamd.log|" \
-e "s|^\#\(LogTime\).*|\1 yes|" \
-e "s|^\#\(ConcurrentDatabaseReload\).*|\1 no|" \
"/clamav/etc/clamav/clamd.conf.sample" > "/clamav/etc/clamav/clamd.conf" && \
sed -e "s|^\(Example\)|\# \1|" \
-e "s|.*\(PidFile\) .*|\1 /run/lock/freshclam.pid|" \
-e "s|.*\(DatabaseOwner\) .*|\1 clamav|" \
-e "s|^\#\(UpdateLogFile\) .*|\1 /var/log/clamav/freshclam.log|" \
-e "s|^\#\(NotifyClamd\).*|\1 /etc/clamav/clamd.conf|" \
-e "s|^\#\(ScriptedUpdates\).*|\1 yes|" \
-e "s|^\#\(TestDatabases\).*|\1 no|" \
"/clamav/etc/clamav/freshclam.conf.sample" > "/clamav/etc/clamav/freshclam.conf" && \
What do you think?
I think that instead of executing tail -f "/dev/null"
as the main running process we should start clamav
.
That way, if it crashes, the container will restart and will not become a zombie container.
So the scenario will be like this:
- run
freshclam
once to update the database - run
freshclam
as a daemon - run
clamav
in foreground
@victorwedo I think I agree with you. It's still possible the freshclam daemon might die and no one would be the wiser. But that's the case right now anyways. So, what you suggest seems like an improvement to me.
Anyone have any thoughts on why we shouldn't do this? @oliv3r any thoughts?
So I've this container running for months, but without any memory restrictions on my containers :) I get that however, that that is a serious need for certain setups;
First, I just want to re-state what @micahsnyder wrote
But even with this change, I think we'll still run into this bug later when freshclam does an update while clamd is running and clamd reloads.
btw, instead of tail, we can also see if we can just start a process monitor of course, right? 'is PID of freshd still ok; is PID of clamd still ok'? and exit if not? That's just some shell magic instead of a tail :) The tail just exists to keep the container running and is ugly in itself anyhow ...
Lack of RAM, will always be an issue, right?
I think (at double the potential ram cost, due to freshclam also needing a lot of ram, is to launch 2 containers, one managing freshclamd, and one managing clamd; which is supported via the variables, to support both use cases. That is the main reason we have tail
at the end of the container, instead of clamd
, I think breaking that way of working is less ideal; as it also breaks the '5 clamd containers with 1 freshclamd container' as you couldn't have a pure freshclam container anymore.
I think @eht16 makes a sensible statement:
I don't know tini but it might be at fault of not killing the process properly and leave the zombie instead. IMO this should be fixed as OOM might happen or also other reasons which lead to killing the clamd process.
The whole purpose of an init/tini, is to do exactly this, isnt' it? Reap processes. We could see if dumb-init
actually behaves properly?
As for 'start freshclam first; wait for it to finish; then start clamd; and freshclamd again; is really also just ducktape on the issue isn't it? Why should clamd not function when the database is out of date? Btw, didn't we do this anyway, if we don't have any database (as clamd can't function without it).
While I think it's good to make this more robust (making sure init catches the issues); I also think we should solve the problem at the core as well (avoid zombies to begin with, which granted, is a longer term goal of course).