cloudbeat
cloudbeat copied to clipboard
[CNVM] Trivy local cache file size increases indefinitely
Describe the bug
Trivy uses a local file (bbolt db) as a cache in the /tmp
directory (/tmp/trivy/fanal/fanal.db
) that always increases in size with each cycle.
This results in the tmpfs
file system holding the /tmp
folder getting filled up (1), and Cloudbeat can no longer download the new trivy db (which it does on each cycle). This leads to Cloudbeat's crash loop and not providing cnvm findings. It could also have implications for other applications hosted in the same instance that could use /tmp
for any crucial operation.
(2) /tmp
(tmpfs
) is a ram disk (placed in ram) with a maximum size, usually half of the host's total ram.
(Example screenshots of fanal.db
size before and after some runs)
[ec2-user ~]$ sudo tree -h /tmp
/tmp
└── [ 60] trivy
└── [ 60] fanal
└── [ 1.9G] fanal.db
12 directories, 1 file
[ec2-user ~]$ free -h
total used free shared buff/cache available
Mem: 15Gi 587Mi 8.4Gi 1.9Gi 6.3Gi 12Gi
Swap: 0B 0B 0B
Preconditions Any cnvm deployment.
To Reproduce
- Create a cnvm deployment with agent version >= 8.12 (pending checking older releases as well).
- Wait for many runs to pass (depending on the cloud assets, and host's ram size).
Expected behavior Cloudbeat will be able to work indefinitely and produce events on each cycle.
Workaround till the fix
Restarting the host machine will delete everything from /tmp,
and thus the fanal.db
so Cloudbeat can continue to work and produce findings.
@orestisfl didn't you also create a ticket for this?
@orestisfl didn't you also create a ticket for this?
https://github.com/elastic/security-team/issues/8217
I took a look into the ticket https://github.com/elastic/security-team/issues/8217
It seems the root cause is the same, but we just got a different error during the db update flow.
Trivy uses this to specify cache directory: https://github.com/aquasecurity/trivy/blob/d4da83c633a46ad4a61844d8d5502d87b99465a0/pkg/utils/fsutils/fs.go#L23-L29
func defaultCacheDir() string {
tmpDir, err := os.UserCacheDir()
if err != nil {
tmpDir = os.TempDir()
}
return filepath.Join(tmpDir, "trivy")
}
Which in most cases return a cache directory into filesystem (e.g. /root/.cache/trivy
if is run as root)
Unless os.UserCacheDir()
returns an error in which case it uses /tmp
.
The function os.UserCacheDir()
returns error (in Linux) when both XDG_CACHE_HOME
and HOME
env var are not defined:
https://cs.opensource.google/go/go/+/master:src/os/file.go;l=501-510?q=UserCacheDir&ss=go%2Fgo
default: // Unix
dir = Getenv("XDG_CACHE_HOME")
if dir == "" {
dir = Getenv("HOME")
if dir == "" {
return "", errors.New("neither $XDG_CACHE_HOME nor $HOME are defined")
}
dir += "/.cache"
}
Whic in our case there are not.
$ sudo cat /proc/$(pidof cloudbeat)/environ | tr '\0' '\n'
PWD=/opt/Elastic/Agent
SYSTEMD_EXEC_PID=2095
LANG=C.UTF-8
INVOCATION_ID=...
SHLVL=0
JOURNAL_STREAM=...
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
AGENT_COMPONENT_ID=cloudbeat/vuln_mgmt_aws-default
AGENT_COMPONENT_TYPE=cloudbeat/vuln_mgmt_aws
Cloudbeat that runs under elastic-agent does not inherit all environment variables.
So that explains both error logs we had.
Since the https://github.com/elastic/security-team/issues/8217 definition of done was to find the root cause of the issue (apart from solving it) and since the root cause was found during the investigation that led to this ticket, if there is no objection, I will close it as done, referring to this one.
Verified by checking the fanal.db
size in a period of a week:
Measures:
{
"date": "Wed Jul 17 13:49:38 UTC 2024",
"size": "113M"
}
{
"date": "Thu Jul 18 08:49:00 UTC 2024",
"size": "252M"
}
{
"date": "Thu Jul 18 11:08:38 UTC 2024",
"size": "86M"
}
{
"date": "Thu Jul 18 13:08:39 UTC 2024",
"size": "111M"
}
{
"date": "Thu Jul 18 15:08:40 UTC 2024",
"size": "158M"
}
{
"date": "Thu Jul 18 21:36:46 UTC 2024",
"size": "222M"
}
{
"date": "Fri Jul 19 06:52:34 UTC 2024",
"size": "222M"
}
{
"date": "Fri Jul 19 19:06:46 UTC 2024",
"size": "202M"
}
{
"date": "Sat Jul 20 06:33:31 UTC 2024",
"size": "202M"
}
{
"date": "Sun Jul 21 08:32:26 UTC 2024",
"size": "194M"
}
{
"date": "Sun Jul 21 10:32:29 UTC 2024",
"size": "59M"
}
{
"date": "Sun Jul 21 12:32:31 UTC 2024",
"size": "84M"
}
{
"date": "Sun Jul 21 14:32:32 UTC 2024",
"size": "141M"
}
{
"date": "Sun Jul 21 16:32:33 UTC 2024",
"size": "189M"
}
{
"date": "Sun Jul 21 18:32:35 UTC 2024",
"size": "189M"
}
{
"date": "Sun Jul 21 20:32:36 UTC 2024",
"size": "189M"
}
{
"date": "Sun Jul 21 22:32:37 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 00:32:38 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 02:32:39 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 04:32:41 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 06:32:42 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 08:32:44 UTC 2024",
"size": "189M"
}
{
"date": "Mon Jul 22 10:32:51 UTC 2024",
"size": "59M"
}
{
"date": "Mon Jul 22 12:32:57 UTC 2024",
"size": "112M"
}
{
"date": "Mon Jul 22 14:33:03 UTC 2024",
"size": "166M"
}
{
"date": "Mon Jul 22 16:33:09 UTC 2024",
"size": "184M"
}