aws-s3-virusscan icon indicating copy to clipboard operation
aws-s3-virusscan copied to clipboard

ClamAV stopped scanning files

Open nidhigwari opened this issue 3 years ago • 11 comments

We use Clamav for scanning files for our application. We use S3, SQS, clamAV integration. It seems to have stopped working suddenly. Adding: clam av version: ClamAV 0.103.6

nidhigwari avatar Jul 21 '22 15:07 nidhigwari

Sorry, we are not providing support for this free/open-source project. Check out our solution bucketAV with professional support included: https://bucketav.com

andreaswittig avatar Jul 21 '22 15:07 andreaswittig

Oddly enough, we had this happen to us yesterday as well. Any instance of the s3-virusscan that we had running on a t3.micro all suddenly died at the same. Log inspection lead us to find that they all ran out of RAM and OOM killer killed clamd, but when systemd tried restarting it, it couldn't. I don't know enough about how clamd works when it phones home to get signature updates, but one theory is that it pulled an update yesterday that maxed out all the ram on the smaller instances. We fixed it just by launching new instances.

rmerrellgr avatar Jul 21 '22 15:07 rmerrellgr

@nidhigwari Sorry, I was to fast and harsh.

@rmerrellgr Thanks for providing more context.

andreaswittig avatar Jul 21 '22 15:07 andreaswittig

Thanks @rmerrellgr! We have launched new instances, but still the service is not working. We also see freshclam related error "WARNING: FreshClam previously received error code 429 or 403 from the ClamAV Content Delivery Network (CDN). This means that you have been rate limited or blocked by the CDN."

nidhigwari avatar Jul 21 '22 16:07 nidhigwari

@nidhigwari ClamAV introduced very strict throttling limits. We have been running into those limits as well and are now hosting our own mirror of the malware database.

andreaswittig avatar Jul 21 '22 19:07 andreaswittig

Oddly enough, we had this happen to us yesterday as well. Any instance of the s3-virusscan that we had running on a t3.micro all suddenly died at the same. Log inspection lead us to find that they all ran out of RAM and OOM killer killed clamd, but when systemd tried restarting it, it couldn't. I don't know enough about how clamd works when it phones home to get signature updates, but one theory is that it pulled an update yesterday that maxed out all the ram on the smaller instances. We fixed it just by launching new instances.

Is it possible that you tried to scan a "large" S3 object? Did you check the dead-letter queue?

andreaswittig avatar Jul 21 '22 19:07 andreaswittig

@rmerrellgr what is the value of the SwapSize parameter?

michaelwittig avatar Jul 21 '22 19:07 michaelwittig

@nidhigwari ClamAV introduced very strict throttling limits. We have been running into those limits as well and are now hosting our own mirror of the malware database.

What is the recommendation? How does the customer determine if the issue is because of throttling? Currently no files are being scanned and the issue impacts dev, stag, and prod environments. All appear to have been impacted on the same day.

Please help us understand what changes were made since July 15th so we can determine the best course of action for troubleshooting.

awsnicolemurray avatar Jul 21 '22 21:07 awsnicolemurray

@andreaswittig Nope, no large file scans (no scans at all for some time before the crash, actually). But as I suspected, this is what we find in the logs

  • Jul 20 11:41:09 clamd[27447]: Database correctly reloaded (8622752 signatures)
  • Jul 20 11:41:11 clamd[27447]: Activating the newly loaded database...
  • Jul 20 11:41:13 kernel: amazon-cloudwat invoked oom-killer:
  • (Followed by 100+ lines of OOM killer output, which ultimately lead to clamd being killed)
  • Jul 20 11:41:13 kernel: Killed process 27447 (clamd)
  • Jul 20 11:41:13 systemd: Unit [email protected] entered failed state.
  • Jul 20 11:41:14 systemd: [email protected] holdoff time over, scheduling restart.
  • Jul 20 11:48:14 systemd: [email protected] start operation timed out. Terminating.

At which point it just loops forever in this state of trying to start back up, but it can't. At this point, I just decided it would be easier to just launch replacement instances and be done with it.

I think it's safe to say that this isn't a Widdix problem. We have production level instances running on larger instances and they did not suffer the same fate as these. I just found it peculiar that our dev servers died unexpectedly and then someone else reported that there's did as well. I do not believe any action needs to be taken on your part, however.

And to answer your other question, these t3.micro instances have the SwapSize set to 2 in the CF config.

rmerrellgr avatar Jul 21 '22 21:07 rmerrellgr

@awsnicolemurray I'd recommend to check the logs.

andreaswittig avatar Jul 22 '22 07:07 andreaswittig

@rmerrellgr Interesting, haven't observed something like this before.

andreaswittig avatar Jul 22 '22 07:07 andreaswittig