docker-logging-plugin icon indicating copy to clipboard operation
docker-logging-plugin copied to clipboard

Splunk driver not getting response from splunk makes docker unresponsive

Open Carles-Figuerola opened this issue 4 years ago • 13 comments

What happened: We have a cluster of nodes running docker and managed by marathon/mesos. The containers running there are using the docker splunk logging plugin to send logs to the splunk event collector.

The load balancer in front of the splunk event collector was having trouble connecting so from the point of view of the logging plugin, the https connections were being opened, but not replied, so all connections were "hanging". This made all the environment unstable as containers were not passing healthchecks and not able to serve the application running on them.

An example of the logs seen in docker are:

Aug 12 12:50:34 dockerhost.local dockerd[10030]: time="2019-08-12T12:50:34.493818095-07:00" level=warning msg="Error while sending logs" error="Post https://splunk-ec:443/services/collector/event/1.0: context deadline exceeded" module=logger/splunk

The manual connection to the splunk-ec shows that it hangs after sending the headers and will get no response at all:

$ curl -vk https://splunk-ec:443/services/collector/event/1.0
* About to connect() to splunk-ec port 443 (#0)
*   Trying 10.0.0.1...
* Connected to splunk-ec (10.0.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_RSA_WITH_AES_256_CBC_SHA
* Server certificate:
*       subject: CN=<REDACTED>
*       start date: Jan 22 16:45:30 2010 GMT
*       expire date: Jan 23 01:36:42 2020 GMT
*       common name: <REDACTED>
*       issuer: CN=Entrust Certification Authority - L1C,OU="(c) 2009 Entrust, Inc.",OU=www.entrust.net/rpa is incorporated by reference,O="Entrust, Inc.",C=US
> GET /services/collector/event/1.0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: splunk-ec
> Accept: */*
>
^C

What you expected to happen: If the splunk logging driver can't send logs for any reason, it should fill the buffer and drop logs when it's full, not make the docker agent unstable and make the application inaccessible

How to reproduce it (as minimally and precisely as possible): Have a small app (maybe just nc -l -p443) listen in https but not make any reply either successful or unsuccessful, then point the splunk logging plugin there.

Anything else we need to know?: The docker agent runs with these environment variables:

SPLUNK_LOGGING_DRIVER_BUFFER_MAX=400
SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE=200
SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE=20

the containers are running with these options:

--log-driver=splunk
--log-opt=splunk-token=<token>
--log-opt=splunk-url=https://splunk-ec:443
--log-opt=splunk-index=app
--log-opt=splunk-sourcetype=<sourcetype>
--log-opt=splunk-insecureskipverify=true
--log-opt=env=APP_NAME,HOST,ACTIVE_VERSION
--log-opt=splunk-format=raw
--log-opt=splunk-verify-connection=false

Environment:

  • Docker version (use docker version):
Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:47:25 2019
  OS/Arch:          linux/amd64
  Experimental:     false
  • OS (e.g: cat /etc/os-release):
CentOS Linux release 7.6.1810 (Core)
Linux hostname 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Splunk version:
7.1.6

(this shouldn't affect as the problem was with splunk not getting an https response from the load balancer)

  • Others:

Carles-Figuerola avatar Aug 19 '19 16:08 Carles-Figuerola

Thank You for reporting issue - @Carles-Figuerola We are aware that the unavailability of Splunk can have really negative consequences on container health.

This will definitely be something we address and fix, however we may make the behavior an explicit selection. Ideally we need a solution which doesn't bork the container and doesn't drop logs.

dtregonning avatar Aug 20 '19 17:08 dtregonning

Making it an explicit selection is totally an acceptable and good solution. Thanks!

Carles-Figuerola avatar Aug 20 '19 18:08 Carles-Figuerola

@dtregonning Thank you for responding to the report so quickly. Unfortunately this also affects my application. Are there any updates on this? Alternatively, is there are recommended workaround?

PiotrJustyna avatar Oct 01 '19 19:10 PiotrJustyna

Any updates on this?

gabricar avatar Jul 13 '20 21:07 gabricar

Any workarounds, such as tweaking timeout/retry/abandon settings, etc? Any way to detect the issue existing aside from monitoring docker daemon log stream?

zerog2k avatar Aug 20 '20 14:08 zerog2k

we are also having the same problem. Any updates on this?

fabriciofelipe avatar Sep 23 '20 16:09 fabriciofelipe

As a workaround, try to enable non-blocking log delivery mode:

https://docs.docker.com/config/containers/logging/configure/#configure-the-delivery-mode-of-log-messages-from-container-to-log-driver

ykapustin avatar Oct 17 '20 01:10 ykapustin

bump

johnjelinek avatar Jan 25 '21 19:01 johnjelinek

Same problem here with dockers built-in splunk log driver. Problem starts with connections in CLOSE_WAIT to heavy forwarder and then propagates to docker containers. Was hoping to find a solution here, but apparently same problem would apply using the plugin.

joachimbuechse avatar Jun 01 '21 23:06 joachimbuechse

Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:

  "log-opts": {
    "mode": "non-blocking",
    "max-buffer-size": "4m"
 }

On a side note, we also want to ensure that we never compromise the availability of production containers/docker api, even if logging system (still very important) goes down. Ideally, there would be some local temporary filesystem buffer (up to some sane limit, say ~100MB, possibly depending upon how noisy and how many containers you run) which would allow queued up delivery of logs when/if the splunk endpoint eventually comes back up. This could ensure that temporary splunk endpoint availability issues are survivable without noticeable impact to container functionality and logging, while allowing for container functionality (at the cost of lost logs) for extended splunk endpoint availability outages.

zerog2k avatar Oct 01 '21 15:10 zerog2k

Can anyone here with this issue that setting mode to non-blocking at least mitigates this issue? i.e. does this work in docker daemon.json with the splunk logging driver:

  "log-opts": {
    "mode": "non-blocking",
    "max-buffer-size": "4m"
 }

Since we enabled non-blocking mode we haven't see any issue.

ykapustin avatar Oct 01 '21 21:10 ykapustin

Hello, I am currently experiencing the same issues as above.

Already set the non-blocking mode for docker as well as using the default Splunk globals as per below.

SPLUNK_LOGGING_DRIVER_POST_MESSAGES_FREQUENCY | 5s SPLUNK_LOGGING_DRIVER_POST_MESSAGES_BATCH_SIZE | 1000 SPLUNK_LOGGING_DRIVER_BUFFER_MAX | 10 * 1000 SPLUNK_LOGGING_DRIVER_CHANNEL_SIZE | 4 * 1000

Any current updates of the Splunk driver that might fix this?

Would reduce the buffer_max and channel_size values help me in any way? People mentioned that setting the non-blocking mode worked for them, wonder if it's something else that helped in conjunction with that.

P.S - A timeout would really help here :)

Idriosiris avatar Sep 23 '22 08:09 Idriosiris