loki
loki copied to clipboard
if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed
Describe the bug we have installed the loki-docker-driver on all our devices. The loki server on an extra server, if the loki-server is updated/restarted or just not reachable then after a short time all containers get stuck (docker logs does not update anymore). If the loki-server is not reachable, the containers can neither be stopped/kill nor restarted.
To Reproduce Steps to reproduce the behavior:
- start loki server (server)
- install loki-docker-driver on another system (can also be tested on one and the same system) (client)
2.1.
/etc/docker/daemon.json
{ "live-restore": true, "log-driver": "loki", "log-opts": { "loki-url": "http://loki:3100/api/prom/push", "mode": "non-blocking", "loki-batch-size": "400", "max-size": "1g" } }
-
docker run --rm --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
(client) -
docker exec -it der-container tail -f /tmp/ts
shows every second the time (client) -
docker logs -f der-container
show numbers from 0-1000000 (client) - stop loki server (server)
- you will see that the outputs on the system stop with the loci-driver and that you cannot stop the container (client)
-
docker stop der-container
(client)
Expected behavior A clear and concise description of what you expected to happen. I would like all containers to continue to run as desired even if the loci is not accessible. That man container can start/stop even if loki is not reachable
Environment:
- Infrastructure: [bare-metal, laptop, VMs]
- Deployment tool: [docker-compose]
Screenshots, Promtail config, or terminal output loki-docker-driver version: loki-docker-driver:master-616771a (from then on the driver option "non-blocking" is supported) loki server: 1.5.0
I am very grateful for any help, this problem has caused our whole system to collapse
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
:-(
This issue is being closed without any comment/ feedback?
For me/ us this is a major issue/ blocker.
@owen-d Can you please comment? Thank you!
#2017 fixed the same problem for me
#2017 fixed the same problem for me
Do you mean setting the non-blocking mode? The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.
I could reproduce the problem:
root@loki # docker run -d --log-driver=loki --log-opt loki-url="http://172.29.95.195:3101/loki/api/v1/push" --log-opt loki-retries=5 --log-opt loki-batch-size=400 --log-opt mode=non-blocking --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"```
Running loki and the above client-container, then stopping loki, the client-container fails:
error from daemon in stream: Error grabbing logs: error decoding log message: net/http: request canceled (Client.Timeout exceeded while reading body)```
#2017 fixed the same problem for me
Do you mean setting the non-blocking mode? The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.
Yeah I meant the non-blocking mode, I haven't noticed it in the original issue, sorry.
No response? 😢
Hi, We are testing Loki for our architecture, and I encountered this issue too
I found out that the time needed to stop a container (any container) has "penalty" between 5 and 15 minutes when loki is the logging driver and the destination server (either loki or promtail) is unreachable. In our testing architecture, we have the docker log driver that pushes the logs to the promtail container, and promtail that pushes the logs to the loki server (I tought (promtail cached and so) it could have been a good idea)
+-----------------------+ +--------------------+
| Virtual Machine 01 | | Virtual Machine 02 |
| | | |
| +------+--------+ | | |
| |Loki | Docker | | | |
| |DRIVER| | | | |
| +-+---++ | | | |
| | ^ | | | | +--------+ |
| | | +-v------+ | | | | Loki | |
| | | |Promtail+----------->+ Server | |
| | | +--------+ | | | | | |
| | | | | | +--------+ |
| | +-------+ | | | |
| | | NGINX | | | | |
| | +-------+ | | | |
| +---------------+ | | |
| | | |
+-----------------------+ +--------------------+
At the moment we are trying with the mode: non-blocking
mode, and, other than slowing down the stop of the promtail container itself, ~~it seems to be ok with the other containers~~ but it's not working anyway.
Is there any viable fix available at the moment?
I'm investigating!
you can even reproduce by directly start any container with loki logger and some unreachable loki-url,
- with
local
log driver
docker run --log-driver local --log-opt max-size=10m alpine ping 127.0.0.1
- with
loki
log driver
docker run --log-driver loki --log-opt loki-url="http://172.17.0.1:3100/loki/api/v1/push" alpine ping 127.0.0.1
case 1, you can stop/kill container case 2, you can stop/kill container only after 5 mins or so
docker daemon log is not that useful either.
level=warn ts=2020-10-28T11:55:05.178484441Z caller=client.go:288 container_id=eb8c67b975f20837210c638d5f83fa1fa011c183c725af337c1fad9ffb2d3a01 component=client host=172.17.0.1:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://172.17.0.1:3100/loki/api/v1/push\": dial tcp 172.17.0.1:3100: connect: connection refused"
I probably figured out the reason why it takes so much time, and I can say my suspect was true and I think this is probably an intended behavior: As we can read from the source code, the message is given inside the backoff logic loop.
If we try to start a container reducing to the (almost) minimum the backoff options, we can see the container stops (almost) immediately:
docker run --log-driver loki --log-opt loki-url="http://0.0.0.0:3100/loki/api/v1/push" --log-opt loki-time out=1s --log-opt loki-max-backoff=800ms --log-opt loki-retries=2 alpine ping 127.0.0.1
(If you want to keep the log file after the container stopped, add the --log-opt keep-file=true
parameter)
As far as my undestanding goes tho, if the driver is unable to send the logs withing the backoff frame, the logs will be lost (so I would consider the keep-file seriously...)
In my opinion the best thing to do would be to cache locally the logs if the client is unable to send the logs within the bakeoff window, to send them later on
Agree with backoff logic,
Tested with fluentd log driver, looks like same there as well, except may be fluentd have some default lower backoff time (so that container stops more quickly). And I see this on daemon log
dockerd[1476]: time="2020-10-28T17:50:12.580014937+01:00" level=warning msg="Logger didn't exit in time: logs may be truncated"
also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.
also 5mins time limit is from the default max-backoff we use. https://github.com/grafana/loki/blob/master/pkg/promtail/client/config.go#L19
also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.
I disagree, as starting a service may be more important than having its log (and debugging may not be that easy) I would rather use a feature-flag and by default keep it disabled
As I said, in my opinion the best opinion would be to cache the logs and send them as soon as a Loki endpoint becomes available; In the meantime find a way to warn the user about the unreachable endpoint and cache the logs.
Agree that a better way of understanding how to maintain control over a docker container when the end-point is unavailable is critical. I've been experimenting with different architecture deployments of Loki and found that even a Kill of the docker container doesn't work. Not being able to control a shutdown/restart of a container because I can't send logs out of the Loki driver shouldn't impact my container. Will look to change my container properties defaults to get around this.
Maybe we should accept the behaviour of the docker driver plugin and send the logfiles to a local "kind of daemonset" promtail, which supports the loki push api?
https://grafana.com/docs/loki/latest/clients/promtail/#loki-push-api
Problem still actual. Preparing: Uninstalled and installed the newest loki from instruction https://grafana.com/docs/loki/latest/clients/docker-driver/
Test:
First clear test. I run container by default log-driver or --log-driver=none like this:
docker run --name debian-loki-test --rm debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
Stopping very fast:
time docker stop debian-loki-test
real 0m10,391s
But when I use loki driver with fake loki address:
docker run --name debian-loki-test --rm --log-driver=loki --log-opt loki-url="http://loki.fake/loki/api/v1/push" --log-opt mode=non-blocking debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
Stooping too long:
time docker stop debian-loki-test
real 12m9,488s
But app(seq 0 1000000) stopping early, almost at once.
When loki address is right:
docker run --name debian-loki-test --rm --log-driver=loki --log-opt loki-url="http://loki.lan/loki/api/v1/push" --log-opt mode=non-blocking debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
Stopping faster:
time docker stop debian-loki-test
real 0m18.450s
And starting the container when loki was, but down it, and returned loki online after few minutes:
time docker stop debian-loki-test
real 3m58.299s
when loki back online stopping was sharp.
And I test fix: docker plugin install kavirajk/loki-docker-driver:latest --alias loki-fix it does not work too.
Any idea on when this will be released? Or is there a workaround so we can use loki with docker in production?
We're also experiencing this problem now, and the PR that closed this issue hasn't been part of any changelog yet so I'm guessing it hasn't been released?
The fix was already merged and released (should be available since > 2.0.1).
Fix should be available in grafana /loki-docker-driver
. checking if anything changed recently
I'm by no means a contributor or anything in this project, but it seems the changes added in https://github.com/grafana/loki/pull/2898 are added neither in 2.1.0
nor in 2.0.1
or at least there's still l.client.Stop()
instead of l.client.StopNow()
at https://github.com/grafana/loki/blob/v2.1.0/cmd/docker-driver/loki.go#L73 . If there is a way one can help with getting this stuff into a released version I'm more than happy to help.
@CptDaniel you are right. The fix is not included in any of the release yet.
But if anyone using grafana/loki-docker-driver:latest
(which is sync with latest master always) should contain the fix.
I'm investigating why it doesn't work on grafana/loki-docker-driver:latest
. I will keep posted.
Some findings.
So the docker-driver (with fix) seems to work(at least some cases)
With grafana/docker-driver:2.1.0
(without the fix) - Container takes long to stop.
-bash5.0$ docker plugin install grafana/loki-docker-driver:2.1.0 --alias loki-2.1.0
-bash5.0$ docker run --name alpine-loki-test --rm --log-driver=loki-2.1.0 --log-opt mode=non-blocking --log-opt loki-url="http://loki.fake/loki/api/v1/push" alpine ping 127.0.0.1
-bash5.0$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
59ee91811f72 alpine "ping 127.0.0.1" 3 seconds ago Up 2 seconds alpine-loki-test
time docker kill 59ee91811f72
59ee91811f72
real 6m5.343s
user 0m0.049s
sys 0m0.051s
With grafana/docker-driver:latest
(with the fix) - Container stops immediately
-bash5.0$ docker plugin install grafana/loki-docker-driver:latest --alias loki
-bash5.0$ docker run --name alpine-loki-test --rm --log-driver=loki --log-opt mode=non-blocking --log-opt loki-url="http://loki.fake/loki/api/v1/push" alpine ping 127.0.0.1
-bash5.0$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9c396baa78d8 alpine "ping 127.0.0.1" 3 seconds ago Up 2 seconds alpine-loki-test
-bash5.0$ time docker kill 9c396baa78d8
9c396baa78d8
real 0m1.298s
user 0m0.036s
sys 0m0.026s
But having said all these, still when you run this process
/bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"
the fix doesn't work.
currently investigating to understand this behaviour.
Hi @kavirajk, are there any new informations regarding this weird behaviour not covered by the fix?
Hello, I'm experiencing a similar issue with the Loki driver.
Containers become unresponsive and cannot be stopped. The only thing that helps to revive the system is to restart the Docker daemon itself via systemctl restart docker
.
Maybe you'll find these logs to be useful:
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="main.(*loki).Log(0xc000295f80, 0xc000441030, 0xc000092000, 0x0)" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/loki.go:69 +0x2fb" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/driver.go:165 +0x4c2" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/driver.go:116 +0xa75" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
I am now using vector.dev as a log collector on docker hosts. It collects the logs through the docker API and does not need a driver and does not require changing the configuration of the containers.
The following configuration will send the logs as json to loki.
/etc/vector/vector.toml
[sources.docker-local]
type = "docker_logs"
docker_host = "/var/run/docker.sock"
exclude_containers = []
# Identify zero-width space as first line of a multiline block.
multiline.condition_pattern = '^\x{200B}' # required
multiline.mode = "halt_before" # required
multiline.start_pattern = '^\x{200B}' # required
multiline.timeout_ms = 1000 # required, milliseconds
[sinks.loki]
# General
type = "loki" # required
inputs = ["docker*"] # required
endpoint = "https://loki.example.com:443" # required
# Auth
auth.strategy = "basic" # required
auth.user = "username" # required
auth.password = "asdfasdf" # required
# Encoding
encoding.codec = "json" # required
# Healthcheck
healthcheck.enabled = true # optional, default
# Loki Labels
labels.forwarder = 'vector'
labels.host = '{{ host }}'
labels.container_name = '{{ container_name }}'
labels.compose_service = '{{ label.com\.docker\.compose\.service }}'
labels.compose_project = '{{ label.com\.docker\.compose\.project }}'
labels.source = '{{ stream }}'
labels.category = 'dockerlogs'
Example Loki Query from Docker Logs Dashboard for output:
{host=~"$host",category="dockerlogs",compose_project=~"$project",compose_service=~"$service"} |~ "$search"
| json | line_format "{{.container_name}} {{ .source }} {{.message}}"
NOTE: vector currently only reads live logs, it does not collect past logs and does not support checkpointing. so you might miss some logs when vector is stopped/restarted or started later than your services. See https://github.com/timberio/vector/issues/7358
Any updates on this?
Still an issue for us
Any updates?
I investigated the issue in #4082. It also relates to https://github.com/moby/moby/issues/42705.
@Pandry is correct. What happens is the following. The Docker daemon locks when it tries to write the logs of a container. If the container is killed it waits until the logs have been drained. The Loki driver receives the logs and passes them on. If the Loki server is unreachable it retries for each batch. This can take a very long time and it seems the Docker daemon is deadlocked.
This situation is even worse when you set max_retries=0
since this will retry forever. Loki is using Cortex's backoff logic.
Unfortunately, one cannot simply restart the Loki driver as the Docker daemon will not reconnect.
I'm not sure what the expected behavior by the driver would be other than loosing logs.
However, we do have a setup were promtail is running on each node and scrapes the Docker container file logs. Let me dig up the configuration.