docker-logging-plugin
docker-logging-plugin copied to clipboard
Splunk driver not getting response from splunk makes docker unresponsive
What happened: We have a cluster of nodes running docker and managed by marathon/mesos. The containers running there are using the docker splunk logging plugin to send logs to the splunk event collector.
The load balancer in front of the splunk event collector was having trouble connecting so from the point of view of the logging plugin, the https connections were being opened, but not replied, so all connections were "hanging". This made all the environment unstable as containers were not passing healthchecks and not able to serve the application running on them.
An example of the logs seen in docker are:
Aug 12 12:50:34 dockerhost.local dockerd[10030]: time="2019-08-12T12:50:34.493818095-07:00" level=warning msg="Error while sending logs" error="Post https://splunk-ec:443/services/collector/event/1.0: context deadline exceeded" module=logger/splunk
The manual connection to the splunk-ec shows that it hangs after sending the headers and will get no response at all:
$ curl -vk https://splunk-ec:443/services/collector/event/1.0
* About to connect() to splunk-ec port 443 (#0)
* Trying 10.0.0.1...
* Connected to splunk-ec (10.0.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_RSA_WITH_AES_256_CBC_SHA
* Server certificate:
* subject: CN=<REDACTED>
* start date: Jan 22 16:45:30 2010 GMT
* expire date: Jan 23 01:36:42 2020 GMT
* common name: <REDACTED>
* issuer: CN=Entrust Certification Authority - L1C,OU="(c) 2009 Entrust, Inc.",OU=www.entrust.net/rpa is incorporated by reference,O="Entrust, Inc.",C=US
> GET /services/collector/event/1.0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: splunk-ec
> Accept: */*
>
^C
What you expected to happen: If the splunk logging driver can't send logs for any reason, it should fill the buffer and drop logs when it's full, not make the docker agent unstable and make the application inaccessible
How to reproduce it (as minimally and precisely as possible):
Have a small app (maybe just nc -l -p443
) listen in https but not make any reply either successful or unsuccessful, then point the splunk logging plugin there.
Anything else we need to know?: The docker agent runs with these environment variables:
SPLUNK_LOGGING_DRIVER_BUFFER_MAX=400
SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE=200
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE=20
the containers are running with these options:
--log-driver=splunk
--log-opt=splunk-token=<token>
--log-opt=splunk-url=https://splunk-ec:443
--log-opt=splunk-index=app
--log-opt=splunk-sourcetype=<sourcetype>
--log-opt=splunk-insecureskipverify=true
--log-opt=env=APP_NAME,HOST,ACTIVE_VERSION
--log-opt=splunk-format=raw
--log-opt=splunk-verify-connection=false
Environment:
- Docker version (use
docker version
):
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:47:25 2019
OS/Arch: linux/amd64
Experimental: false
- OS (e.g:
cat /etc/os-release
):
CentOS Linux release 7.6.1810 (Core)
Linux hostname 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- Splunk version:
7.1.6
(this shouldn't affect as the problem was with splunk not getting an https response from the load balancer)
- Others:
Thank You for reporting issue - @Carles-Figuerola We are aware that the unavailability of Splunk can have really negative consequences on container health.
This will definitely be something we address and fix, however we may make the behavior an explicit selection. Ideally we need a solution which doesn't bork the container and doesn't drop logs.
Making it an explicit selection is totally an acceptable and good solution. Thanks!
@dtregonning Thank you for responding to the report so quickly. Unfortunately this also affects my application. Are there any updates on this? Alternatively, is there are recommended workaround?
Any updates on this?
Any workarounds, such as tweaking timeout/retry/abandon settings, etc? Any way to detect the issue existing aside from monitoring docker daemon log stream?
we are also having the same problem. Any updates on this?
As a workaround, try to enable non-blocking log delivery mode:
https://docs.docker.com/config/containers/logging/configure/#configure-the-delivery-mode-of-log-messages-from-container-to-log-driver
bump
Same problem here with dockers built-in splunk log driver. Problem starts with connections in CLOSE_WAIT to heavy forwarder and then propagates to docker containers. Was hoping to find a solution here, but apparently same problem would apply using the plugin.
Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:
"log-opts": {
"mode": "non-blocking",
"max-buffer-size": "4m"
}
On a side note, we also want to ensure that we never compromise the availability of production containers/docker api, even if logging system (still very important) goes down. Ideally, there would be some local temporary filesystem buffer (up to some sane limit, say ~100MB, possibly depending upon how noisy and how many containers you run) which would allow queued up delivery of logs when/if the splunk endpoint eventually comes back up. This could ensure that temporary splunk endpoint availability issues are survivable without noticeable impact to container functionality and logging, while allowing for container functionality (at the cost of lost logs) for extended splunk endpoint availability outages.
Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:
"log-opts": { "mode": "non-blocking", "max-buffer-size": "4m" }
Since we enabled non-blocking mode we haven't see any issue.
Hello, I am currently experiencing the same issues as above.
Already set the non-blocking mode for docker as well as using the default Splunk globals as per below.
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_FREQUENCY | 5s SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE | 1000 SPLUNK_LOGGING_DRIVER_BUFFER_MAX | 10 * 1000 SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE | 4 * 1000
Any current updates of the Splunk driver that might fix this?
Would reduce the buffer_max and channel_size values help me in any way? People mentioned that setting the non-blocking mode worked for them, wonder if it's something else that helped in conjunction with that.
P.S - A timeout would really help here :)