vector icon indicating copy to clipboard operation
vector copied to clipboard

docker_logs source: infinite loop starting/stopping watching for container logs

Open ashtuchkin opened this issue 3 years ago • 6 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

Docker image: timberio/vector:0.19.0-debian

Vector Configuration File

sources:
  docker_logs_source:
    type: docker_logs
    exclude_containers:
      - metrics_vector
  # .. some other sources.

transforms:
  docker_logs:
    type: remap
    inputs: [docker_logs_source]
    reroute_dropped: true
    drop_on_error: true
    source: |
      # ... some logic that sometimes fail (see logs below)

sinks:
  loki:
    type: loki
    inputs: [docker_logs, journald_logs, docker_events_logs, dropped_logs]
    endpoint: "https://$${LOKI_HOST}"
    auth:
      strategy: basic
      user: "$${LOKI_USER}"
      password: "$${LOKI_PASSWORD}"
    encoding:
      codec: text
    labels: ...

Debug Output

Don't have for now, will add when I reproduce it again.

Expected Behavior

Vector should run quietly and forward logs for all docker containers.

Actual Behavior

Vector was spamming logs for a day until the host run out of space, pegging CPU at 100%.

Logs

{"log":"2022-01-23T22:22:46.023552Z  INFO vector::app: Log level is enabled. level=\"vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info\"
\n","stream":"stderr","time":"2022-01-23T22:22:46.028287Z"}
{"log":"2022-01-23T22:22:46.026980Z  INFO vector::app: Loading configs. paths=[\"/dev/shm/vector.yml\"]\n","stream":"stderr","time":"2022-01-23T22:22:46.02833532Z"}
{"log":"2022-01-23T22:22:46.116512Z  INFO vector::sources::docker_logs: Capturing logs from now on. now=2022-01-23T22:22:46.109592143+00:00\n","stream":"stderr","time":"2022-01-23
T22:22:46.119402013Z"}
{"log":"2022-01-23T22:22:46.125526Z  INFO vector::sources::docker_logs: Listening to docker log events.\n","stream":"stderr","time":"2022-01-23T22:22:46.127620029Z"}
{"log":"2022-01-23T22:22:46.419966Z  INFO vector::topology::running: Running healthchecks.\n","stream":"stderr","time":"2022-01-23T22:22:46.420566068Z"}
{"log":"2022-01-23T22:22:46.424382Z  INFO vector::topology::builder: Healthcheck: Passed.\n","stream":"stderr","time":"2022-01-23T22:22:46.425591243Z"}
{"log":"2022-01-23T22:22:46.436799Z  INFO vector::topology::running: Starting source. key=docker_events_source\n","stream":"stderr","time":"2022-01-23T22:22:46.437846272Z"}
{"log":"2022-01-23T22:22:46.437531Z  INFO vector::topology::running: Starting source. key=journald_source\n","stream":"stderr","time":"2022-01-23T22:22:46.437887602Z"}
{"log":"2022-01-23T22:22:46.437555Z  INFO vector::topology::running: Starting source. key=internal_metrics\n","stream":"stderr","time":"2022-01-23T22:22:46.437890972Z"}
{"log":"2022-01-23T22:22:46.437565Z  INFO vector::topology::running: Starting source. key=docker_logs_source\n","stream":"stderr","time":"2022-01-23T22:22:46.437893642Z"}
{"log":"2022-01-23T22:22:46.437579Z  INFO vector::topology::running: Starting transform. key=journald_logs\n","stream":"stderr","time":"2022-01-23T22:22:46.437896222Z"}
{"log":"2022-01-23T22:22:46.437587Z  INFO vector::topology::running: Starting transform. key=dropped_logs\n","stream":"stderr","time":"2022-01-23T22:22:46.437898762Z"}
{"log":"2022-01-23T22:22:46.437595Z  INFO vector::topology::running: Starting transform. key=docker_logs\n","stream":"stderr","time":"2022-01-23T22:22:46.437901282Z"}
{"log":"2022-01-23T22:22:46.437606Z  INFO vector::topology::running: Starting transform. key=docker_events_logs\n","stream":"stderr","time":"2022-01-23T22:22:46.437903722Z"}
{"log":"2022-01-23T22:22:46.437615Z  INFO vector::topology::running: Starting sink. key=loki\n","stream":"stderr","time":"2022-01-23T22:22:46.437906192Z"}
{"log":"2022-01-23T22:22:46.437623Z  INFO vector::topology::running: Starting sink. key=prometheus\n","stream":"stderr","time":"2022-01-23T22:22:46.437908672Z"}
{"log":"2022-01-23T22:22:46.445834Z  INFO vector: Vector has started. debug=\"false\" version=\"0.19.0\" arch=\"x86_64\" build_id=\"da60b55 2021-12-28\"\n","stream":"stderr","time
":"2022-01-23T22:22:46.447542333Z"}
{"log":"2022-01-23T22:22:46.445891Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.\n","stream":"stderr","time":"
2022-01-23T22:22:46.447606733Z"}
{"log":"2022-01-23T22:22:46.594277Z ERROR vector::topology::builder: msg=\"Healthcheck: Failed Reason.\" error=Failed to make HTTP(S) request: error trying to connect: tcp connect
 error: Connection refused (os error 111) component_kind=\"sink\" component_type=\"loki\" component_id=loki component_name=loki\n","stream":"stderr","time":"2022-01-23T22:22:46.59
6227004Z"}
{"log":"2022-01-23T22:22:46.630346Z  INFO source{component_kind=\"source\" component_id=docker_logs_source component_type=docker_logs component_name=docker_logs_source}: vector::s
ources::docker_logs: Excluded self container. id=ef5c053920611942a23064e98c3eae40da8e789ea468fb3a78674abd7b56c67f\n","stream":"stderr","time":"2022-01-23T22:22:46.63406848Z"}
{"log":"2022-01-23T22:22:46.637856Z  INFO source{component_kind=\"source\" component_id=journald_source component_type=journald component_name=journald_source}: vector::sources::j
ournald: Starting journalctl.\n","stream":"stderr","time":"2022-01-23T22:22:46.638571309Z"}
{"log":"2022-01-23T22:22:46.709498Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=1833d23c021cf1bd477c2d7ab4f7a1f7063e0451978d4fbfe
657777d5570976c\n","stream":"stderr","time":"2022-01-23T22:22:46.718305045Z"}
{"log":"2022-01-23T22:22:46.709852Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=61d315e39a4fafe3d08dd5809fa12b61542e99218b183b471
611685bb3d57c30\n","stream":"stderr","time":"2022-01-23T22:22:46.718314695Z"}
{"log":"2022-01-23T22:22:46.721271Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=3a090710ac87009def0bf4b3aa19564196b2224141152a81d
097ff2852dd3248\n","stream":"stderr","time":"2022-01-23T22:22:46.726190684Z"}
{"log":"2022-01-23T22:22:46.721653Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=225fbb7379626aae8e1b0ca6efee3480917c90021a5e16840
ccbced14a60fdf6\n","stream":"stderr","time":"2022-01-23T22:22:46.726209294Z"}
{"log":"2022-01-23T22:22:46.721807Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=e2bfb07c88ff3fdb205178a8f5f2fcb0eda9160eef3eda3e3
758a739044a5212\n","stream":"stderr","time":"2022-01-23T22:22:46.726212564Z"}
{"log":"2022-01-23T22:22:46.721952Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=0874c0c3d3ee63ee9260508539ceb0e89a2f56678ed9048a784dbb0e41d76aa6\n","stream":"stderr","time":"2022-01-23T22:22:46.726215864Z"}
{"log":"2022-01-23T22:22:46.722221Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=1a3d8ba3e2afb09609e76f5d90e2991ef1d56d92158252c2034e5c6e6891423e\n","stream":"stderr","time":"2022-01-23T22:22:46.726218584Z"}
{"log":"2022-01-23T22:22:46.765452Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=da78600f78df43fc24168bda3448827ec7399077a86e4df6daca31f04f340729\n","stream":"stderr","time":"2022-01-23T22:22:46.766445688Z"}
{"log":"2022-01-23T22:22:46.914142Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=c1c36aaaec94c9a4852898d869ee3fbb1e08d73a5ff6df5e8f45ca43b40ea4e6\n","stream":"stderr","time":"2022-01-23T22:22:46.914282306Z"}
{"log":"2022-01-23T22:22:46.928339Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=cf87abf59a3cf52f4a8c539453484a57b03f085a7e122fced4109490109c29ee\n","stream":"stderr","time":"2022-01-23T22:22:46.92928347Z"}
{"log":"2022-01-23T22:22:46.981371Z  WARN vector::internal_events::remap: Mapping failed with event; discarding event. error=\"function call error for \\\"format_timestamp\\\" at (816:868): function call error for \\\"to_timestamp\\\" at (834:853): unable to coerce \\\"null\\\" into \\\"timestamp\\\"\" internal_log_rate_secs=30\n","stream":"stderr","time":"2022-01-23T22:22:46.994530507Z"}
{"log":"2022-01-23T22:22:47.016372Z  WARN vector::internal_events::remap: Internal log [Mapping failed with event; discarding event.] is being rate limited.\n","stream":"stderr","time":"2022-01-23T22:22:47.016547917Z"}
{"log":"2022-01-23T22:22:47.221985Z  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=c1c36aaaec94c9a4852898d869ee3fbb1e08d73a5ff6df5e8f45ca43b40ea4e6\n","stream":"stderr","time":"2022-01-23T22:22:47.223667286Z"}
{"log":"2022-01-23T22:22:47.223572Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=c1c36aaaec94c9a4852898d869ee3fbb1e08d73a5ff6df5e8f45ca43b40ea4e6\n","stream":"stderr","time":"2022-01-23T22:22:47.231636204Z"}
{"log":"2022-01-23T22:22:47.270906Z  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=c1c36aaaec94c9a4852898d869ee3fbb1e08d73a5ff6df5e8f45ca43b40ea4e6\n","stream":"stderr","time":"2022-01-23T22:22:47.271970959Z"}
{"log":"2022-01-23T22:22:47.274423Z  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=c1c36aaaec94c9a4852898d869ee3fbb1e08d73a5ff6df5e8f45ca43b40ea4e6\n","stream":"stderr","time":"2022-01-23T22:22:47.274757922Z"}

.. and then 19Gb of repetitions of the last 2 lines, until I ran out of space.

Additional Context

Running Vector in a container on a Docker Swarm node. No k8s.

References

Seems related: https://github.com/vectordotdev/vector/issues/7272#issuecomment-874982388 https://github.com/vectordotdev/vector/issues/5937#issuecomment-768177907

ashtuchkin avatar Jan 27 '22 06:01 ashtuchkin

Update. I've got the same situation again today. After a small investigation, here's what I think has happened:

  1. Another process on the same host tried to allocate more memory than was possible. Swap was disabled, so there were immense memory pressure on the system.
  2. While there were no OOM Kills, docker daemon process didn't like that at all. It printed stack traces of all its goroutines to syslog and started a swarm reelection process.
  3. At that point, the event that one of the containers has died got dropped and Vector didn't receive it.
  4. Now Vector thinks that the container is alive and tries to get its logs. Docker returns nothing. Vector tries to get its logs again, cue infinite loop.

I've added swap to the host so the immediate problem should go away. Meanwhile, it would be nice for Vector to be more resilient to this kind of problem as the failure mode results in pegging CPU at 100%, filling up the disk and completely locking the whole machine.

ashtuchkin avatar Feb 06 '22 06:02 ashtuchkin

I have one container (redis) for which it reports the same issue. If I execute docker logs <container-id>, it outputs a lot of logs, then somewhen this:

1:M 26 Aug 2021 13:20:04.076 * 100 changes in 300 seconds. Saving...
1:M 26 Aug 2021 13:20:04.077 * Background saving started by pid 2104
2104:C 26 Aug 2021 13:20:04.522 * DB saved on disk
2104:C 26 Aug 2021 13:20:04.525 * RDB: 0 MB of memory used by copy-on-write
1:M 26 Aug 2021 13:20:04.581 * Background saving terminated with success
error from daemon in stream: Error grabbing logs: log message is too large (6294022 > 1000000)

marcbachmann avatar Aug 22 '22 14:08 marcbachmann

I'm getting the same issue. You can reproduce it by exceeding 1Mb limit of logs in a container on some Linux distributions: OpenSUSE 15.4, docker 23.0.6-ce, for example

It is worth noting that this doesn't seem to happen on Ubuntu 22.04, even though you're gonna get the same error from docker, Vector is not gonna retry over and over.

retry backoff doesn't seem to affect this behavior in any way

Another observation on OpenSUSE system: docker might crash if I ask it for logs too often. the query can be as simple as: docker logs --tail=10

Moreover, if you try to workaround the issue by limiting the logs, well, docker logger settings doesn't seem to work, for instance if you set:

"driver": "local",
"max-file": "1",
"max-size": "1m",
"compress": "false"

it will still grow indefinitely.

So, I'm not aware of any proper workarounds other than recreating the containers once in a while, or truncating the file log manually, or using some other log drivers to bypass the docker itself.

Anyway, Vector shouldn't stress the system this much, if we could have a setting that could limit the retry, that would be great

tonybart1337 avatar Aug 15 '23 19:08 tonybart1337

I've been encountering this issue on a few Ubuntu 22.04 servers the past couple weeks. The issue that was causing this to happen for me was corruption in a few of the docker logs. Running docker logs <container> would display this error.

error from daemon in stream: Error grabbing logs: invalid character 'l' after object key:value pair

Other times it would be a \x00 error from the docker command.

I've observed a couple types of corruption in the logs, the most prominent being the \x00. You can check if any of your logs contain this with a simple grep.

grep -Pa '\x00' /var/lib/docker/containers/**/*json.log

If there any matches, they can be removed.

find /var/lib/docker/containers/ -type f -name '*json.log' -exec sed -i '/\x00/d' {} +

If that doesn't work, you can try truncating the log file for the container (or all containers). More info can be found here https://github.com/docker/for-linux/issues/140#issuecomment-1501825729.

A setting that would limit the retry would help the system until the logs can be fixed. Vector infinitely retries watching/unwatching the log file as fast as it can which pegs the CPU near 100% until fixed.

Zibbp avatar Sep 03 '23 15:09 Zibbp

I can reproduce this on Ubuntu 22.04 by creating some docker logs using flog:

docker run -it --rm mingrammer/flog -l -d 10

The manually edit the log file:

sudo vi docker/containers/da8f9c995531de72c9e7450e360d11bdee31d3687f53147fc228ad151a9e9801/da8f9c995531de72c9e7450e360d11bdee31d3687f53147fc228ad151a9e9801-json.log

Corrupt the json in some way. Save the file.

Then run vector with this source:

sources:
  in:
    type: docker_logs
    retry_backoff_secs: 10

Vector goes crazy. Note the retry_backoff_secs has no effect.

Obviously docker shouldn't be corrupting the logs files, but I guess sometimes it does. Vector should be able to handle this more gracefully when it does.

StephenWakely avatar Sep 12 '23 12:09 StephenWakely

I have same issue, vector get all CPU to 100%

1xtr avatar Jun 01 '24 20:06 1xtr

I think the fix has arrived: https://github.com/moby/moby/pull/48842 Looks like this should permanently fix this issue since docker will no longer return an error

tonybart1337 avatar Jan 13 '25 11:01 tonybart1337