beats icon indicating copy to clipboard operation
beats copied to clipboard

Elastic agent is not collecting docker container logs with the use of docker dynamic variables

Open MichaelKatsoulis opened this issue 3 years ago • 6 comments

After a question in forum on how to collect docker container logs with elastic agent I stumbled across a potential issue.

As we do not have a specific integration to collect docker logs, we rely on custom-logs which uses logfile input type.

Leveraging the docker dynamic provider variables and setting log path to /var/lib/docker/containers/${docker.container.id}/*-json.log multiple inputs in agent per container are created as expected. Although the log path on those inputs is correct , agent/filebeat does not collect the logs at all.

Steps to reproduce:

  1. Deploy elastic stack on Elastic cloud with version 8.0.0-alpha
  2. Set up a kubernetes environment with docker runtime. minikube start --kubernetes-version=v1.21.3 --container-runtime=docker
  3. Configure inside elastic-agent-standalone-manifest.yaml agent.yml to collect the docker container logs
- name: docker-log
  type: logfile
  use_output: default
  meta:
    package:
      name: log
      version: 0.4.6
  data_stream:
    namespace: default
  streams:
    - data_stream:
        dataset: generic
      paths:
        - /var/lib/docker/containers/${docker.container.id}/*-json.log
  1. Deploy standalone agent with image docker.elastic.co/beats/elastic-agent:8.0.0-SNAPSHOT. Filtering the logs in Kibana with event.dataset:generic , no logs are shown
  2. Exec in agent container and run ./elastic-agent -c /etc/agent.yml inspect output -o default | grep docker, we can see one input per docker container. This means that the docker.container.id variable is populated by the docker dynamic provider.
name: docker-log
- /var/lib/docker/containers/06fd6e8770a51b700f2abaa584479ab22e4733953edf30be919a8f00da0c6814/*-json.log
        io_kubernetes_docker_type: container
        maintainer: NGINX Docker Maintainers <[email protected]>
name: docker-log
- /var/lib/docker/containers/0a07c2e613a301b2669e3f90dde2dd454000ea3829644748c8a0a279114dc9b4/*-json.log
        io_kubernetes_docker_type: container
name: docker-log
- /var/lib/docker/containers/0e30f4a7cc42ef16219a108911079fbe1734ffc2c7a19c06c995d173808dc936/*-json.log
        io_kubernetes_docker_type: podsandbox
name: docker-log
- /var/lib/docker/containers/1f58c8b2eefe5889716786910544223a3072493e2eac7676b2aa2460eb199d63/*-json.log
        io_kubernetes_docker_type: container
  1. Although those log paths are correct and full of logs, filebeat does not collect the logs or log anything supsicious.
  2. If we update the agent.yml configmap removing the dynamic variable and restart the agent then logs are collected
- name: docker-log
  type: logfile
  use_output: default
  meta:
    package:
      name: log
      version: 0.4.6
  data_stream:
    namespace: default
  streams:
    - data_stream:
        dataset: generic
      paths:
        - /var/lib/docker/containers/*/*-json.log

This is very weird and it seems that either the logfile input is not working as expected(although I tested the same with filestream input and did not fix the problem) or the agent-filebeat communication may have a problem. Or something different.

Also the documentation of docker dynamic provider is outdated. Variables like docker.id do not exist. Instead the provider populates docker.container.id. We can see the mapping it creates per event here.

MichaelKatsoulis avatar Nov 18 '21 10:11 MichaelKatsoulis

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine avatar Nov 18 '21 15:11 elasticmachine

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Nov 18 '21 15:11 elasticmachine

@faec did you make any progress on this or should we move it to another release? cc @cmacknz

jlind23 avatar Apr 25 '22 06:04 jlind23

@faec @cmacknz what shall we do here?

jlind23 avatar May 17 '22 12:05 jlind23

I don't like that this doesn't work properly, but it does seem from the description that there is a work around for the underlying bug in changing the path from /var/lib/docker/containers/${docker.container.id}/*-json.log to /var/lib/docker/containers/*/*-json.log.

There are no other open issues for this that I could find. I think we could move it to the backlog if @faec isn't close to fixing it.

cmacknz avatar May 17 '22 15:05 cmacknz

I was just trying to figure out why I couldn't find any logs in Kibana for my containers when I found this issue.

Am I reading things correctly in that the official way to use Elastic-Agent to ingest Docker container logs is a custom log integration pointed at /var/lib/docker/containers/*/*-json.log?

jerrac avatar Aug 11 '22 22:08 jerrac

I'm currently trying to find ways to make this work out of the box with a pre-defined agent policy, and I see the following example in the documentation:

inputs:
  - type: logfile
    path: "${docker.paths.log}"

Do I understand correctly that this doesn't work either at the moment? (since OP used an explicit path instead)

antoineco avatar Sep 29 '22 13:09 antoineco

I've been looking into this issue one thing is not clear to me: Is the Elastic-Agent running as a Kubernetes Pod or directly as a docker container? @MichaelKatsoulis I see you used minikube, was it just to have some containers running or are you deploying the Elastic-Agent on this minikube cluster? That impacts a lot how Elastic-Agent is deployed an how it will access the logs.

Based on the title of the issue, I'll assume Elastic-Agent is deployed directly as a docker container, this means its container will need some special permissions and mounts to access the docker logs (that are stored on the host). That explains why the correct configuration is generated but no logs are sent to Elasticsearch.

The way I'm testing it:

  1. Deploy Elastic Cloud with the latest verions (8.4.2)
  2. Start a container that will generate logs (e.g: docker run --rm --name flog -d mingrammer/flog /bin/flog -s1 -d1 -l -f rfc3164).
  3. Create a standalone Elastic-Agent policy using Kibana as suggested in the documentation.
  4. On this policy add the "Custom Logs Integration" configuring the path as /var/lib/docker/containers/${docker.container.id}/*-json.log.
  5. Download the policy (e.g: /tmp/elastic-agent.yml), adjust the Elasticsearch credentials as needed.
  6. Start the Elastic-Agent:
    docker run \
     --rm \
     --name=ea \
     --volume="/tmp/elastic-agent.yml:/usr/share/elastic-agent/elastic-agent.yml"  \
     --volume="/var/lib/docker/containers:/var/lib/docker/containers:ro" \
     --volume="/var/run/docker.sock:/var/run/docker.sock:ro" \
     --user=root \
     docker.elastic.co/beats/elastic-agent:8.4.2 \
     /usr/share/elastic-agent/elastic-agent container -e -d"*" -v
    

This will mount all necessary files/folder into the container so Filebeat can harvest the logs. It will also run the Elastic-Agent with debug logs enabled.

With that setup I validated that the Elastic-Agent/Filebeat can access the necessary files to ingest container logs. However now I'm facing a different issue that, still, does not allow the logs to be ingests: it's an issue with some processor configuration:

{"log.level":"error","@timestamp":"2022-10-04T17:46:31.890Z","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2022-10-04T17:46:31Z - message: Application: filebeat--8.4.2[779f1a28-89a4-43ff-bdf7-65c1100dc9ed]: State changed to FAILED: 1 error occurred:\n\t* 2 errors: Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields\n\n - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}

I'm still investigating and I'll update here onced I have more details. However mounting those volumes might solve the problem to some of you.

belimawr avatar Oct 04 '22 18:10 belimawr

In my case Elastic Agent is NOT in a Docker container. It's just running straight on my vms. Managed via Fleet.

I just tested again. When using /var/lib/docker/containers/${docker.container.id}/*-json.log in my Custom Logs integration, I do not get any Docker logs. When I use /var/lib/docker/containers/*/*-json.log I do get Docker logs.

This Elastic Agent 8.4.0 on Ubuntu 20.04.

jerrac avatar Oct 04 '22 19:10 jerrac

In my case Elastic Agent is NOT in a Docker container. It's just running straight on my vms. Managed via Fleet.

I just tested again. When using /var/lib/docker/containers/${docker.container.id}/*-json.log in my Custom Logs integration, I do not get any Docker logs. When I use /var/lib/docker/containers/*/*-json.log I do get Docker logs.

This Elastic Agent 8.4.0 on Ubuntu 20.04.

Can your Elastic-Agent access the docker API? The default connection is via socket (/var/run/docker.sock), if it cannot access the Docker API it won't be able to resolve the ${docker.container.id} and it will fail silently.

At startup the Elastic-Agent will log if it connected to the Docker API. On failure:

{"log.level":"info","@timestamp":"2022-10-05T10:06:23.743Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/docker.go","file.line":44},"message":"Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","ecs.version":"1.6.0"}

On success:

{"log.level":"debug","@timestamp":"2022-10-05T10:13:15.272Z","log.logger":"docker","log.origin":{"file.name":"docker/client.go","file.line":49},"message":"Docker client will negotiate the API version on the first request.","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-10-05T10:13:15.276Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/watcher.go","file.line":213},"message":"Start docker containers scanner","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-10-05T10:13:15.276Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/watcher.go","file.line":375},"message":"List containers","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-10-05T10:13:15.277Z","log.logger":"composable.providers.docker.bus-docker","log.origin":{"file.name":"bus/bus.go","file.line":94},"message":"map[container:0xc00042e070 start:true]","elastic-agent-autodiscover.bus":"docker","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-10-05T10:13:15.277Z","log.logger":"composable.providers.docker.bus-docker","log.origin":{"file.name":"bus/bus.go","file.line":94},"message":"map[container:0xc00042e0e0 start:true]","elastic-agent-autodiscover.bus":"docker","ecs.version":"1.6.0"}

Those logs are issued at startup.

Do you see any errors on Elastic-Agent or Filebeat logs?

belimawr avatar Oct 05 '22 10:10 belimawr

I did find an issue that makes Filebeat not start inputs and stay on a unhealthy state: https://github.com/elastic/elastic-agent/pull/1420.

So far nobody mentioned errors on Filebeat or it being on a unhealthy state, so I'm not sure is the root cause for the original issue described.

belimawr avatar Oct 05 '22 17:10 belimawr

I might have been seeing something like that. I can't test it right now, but did you see messages like:

Error decoding JSON: json: cannot unmarshal number into Go value of type map[string]interface {}

in the elastic.filebeat dataset.

jerrac avatar Oct 05 '22 18:10 jerrac

Re https://github.com/elastic/beats/issues/29030#issuecomment-1268242645

root@swarmworker01:/opt/Elastic/Agent# grep -r 'composable.providers.docker' *
Binary file data/elastic-agent-56a002/elastic-agent matches
Binary file elastic-agent matches

And that is after running an elastic-agent restart so that new logs would show up.

I do see metrics for my containers. Those get pulled from unix:///var/run/docker.sock, right? I don't see why Filebeat wouldn't be able to access unix:///var/run/docker.sock if Metricbeat can.

Something I did notice last time I tested the variable /var/lib/docker/containers/${docker.container.id}/*-json.log path, and was able to trigger just now, is that when I use the variable path, my Agents go into "Unhealthy" status, and I see messages like this in the Agent logs:

14:50:27.292
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
14:50:27.297
elastic_agent
[elastic_agent][info] New State ID is TNQBVzyT
14:50:27.297
elastic_agent
[elastic_agent][info] Converging state requires execution of 3 step(s)
14:50:27.854
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.4.0
14:50:27.854
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.4.0
14:50:27.931
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.4.0-826a3342: 1 error occurred:\n\t* 6 errors: Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields\n\n"
14:50:27.931
elastic_agent
[elastic_agent][error] 2022-10-05T14:50:27-07:00 - message: Application: filebeat--8.4.0[fedf67ba-8aa9-45ce-b779-9547bd1169a3]: State changed to FAILED: 1 error occurred:
	* 6 errors: Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields

 - type: 'ERROR' - sub_type: 'FAILED'
14:50:28.585
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.4.0
14:50:28.585
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.4.0
14:50:29.183
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.4.0
14:50:29.183
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.4.0
14:50:29.898
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.4.0
14:50:29.898
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.4.0
14:50:29.901
elastic_agent
[elastic_agent][info] Updating internal state

Which is what you mentioned in https://github.com/elastic/beats/issues/29030#issuecomment-1267383251

jerrac avatar Oct 05 '22 21:10 jerrac

@jerrac I do not believe 'composable.providers.docker' would appear in the logs at all. I don't get why you run grep -r 'composable.providers.docker' *. Anyways, it's not important.

I do see metrics for my containers. Those get pulled from unix:///var/run/docker.sock, right? I don't see why Filebeat wouldn't be able to access unix:///var/run/docker.sock if Metricbeat can.

Yes, if Metricbeat can access it, then Filebeat can as well. That rules out Filebeat not having access to the Docker API, which is great!

Something I did notice last time I tested the variable /var/lib/docker/containers/${docker.container.id}/*-json.log path, and was able to trigger just now, is that when I use the variable path, my Agents go into "Unhealthy" status, and I see messages like this in the Agent logs:

Yep, that's exactly the issue I'm facing and solving with https://github.com/elastic/elastic-agent/pull/1420. I believe it to be the root cause of not being able to use variables like ${docker.*} on Elastic-Agent.

belimawr avatar Oct 06 '22 09:10 belimawr

Backport to 7.17.x: https://github.com/elastic/beats/pull/33269

belimawr avatar Oct 06 '22 09:10 belimawr

I might have been seeing something like that. I can't test it right now, but did you see messages like:

Error decoding JSON: json: cannot unmarshal number into Go value of type map[string]interface {}

in the elastic.filebeat dataset.

Btw, that's a totally different error and seems to be related with decoding logs as JSON. This also means the log was ingest by Filebeat and sent to Elasticsearch, but the JSON parsing failed. Depending on the input and configuration you're using the original message is also in the event.

belimawr avatar Oct 06 '22 14:10 belimawr

I've been looking into this issue one thing is not clear to me: Is the Elastic-Agent running as a Kubernetes Pod or directly as a docker container? @MichaelKatsoulis I see you used minikube, was it just to have some containers running or are you deploying the Elastic-Agent on this minikube cluster? That impacts a lot how Elastic-Agent is deployed an how it will access the logs.

@belimawr agent is deployed as a kubernetes pod but underneath docker is running as it is the runtime. The agent pod of course has access to the pods and docker as well. I think the root cause of the problem should be the add_fields processors failing as you have noticed already.

MichaelKatsoulis avatar Oct 06 '22 14:10 MichaelKatsoulis

@MichaelKatsoulis I believe you used the manifest file provided on our documentation, adding the input configuration you mentioned, is that it?

I'll quickly test this case just to be on the safe side

belimawr avatar Oct 06 '22 16:10 belimawr

I've just tested and my Elastic-Agent does not get access to the Docker API:

{"log.level":"info","@timestamp":"2022-10-06T16:24:53.361Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/docker.go","file.line":44},"message":"Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","ecs.version":"1.6.0"}

Looking at the manifest file, I see the mount for /var/lib/docker/containers:

            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true

But there is no mount for /var/run/docker.sock, hence the docker provider is not initialised and the variables cannot be resolved.

It's quite interesting that it seems to work for you.

When I added the mount for /var/run/docker.sock, then I got the same add_fields processor error. So I believe that's the one root cause (given all access are working correctly).

belimawr avatar Oct 06 '22 16:10 belimawr

@belimawr I used that manifest but I added the docker.sock mount as well.

MichaelKatsoulis avatar Oct 10 '22 14:10 MichaelKatsoulis

@belimawr I used that manifest but I added the docker.sock mount as well.

Now it all makes sense, thanks for the clarification!

belimawr avatar Oct 10 '22 17:10 belimawr