[BUG] go-carbon using all file FD"s with connections to kafka
Describe the bug We have discovered that go-carbon continually runs out of file descriptors and causes for the log to bloat and file the filesystem. Similar to what was seen in #374 . All the FD's that are being held open are TCP Sockets and connections that Kafka servers where data is read from. All systems are running go-carbon-0.15.6 which according to the perviously mentioned report where it appears there was similar issue this issue should be fixed in this version.
Logs Too many open files on system entries start appearing.
[2021-05-07T03:04:10.056-0500] WARN [tcp] failed to accept connection {"error": "accept tcp 0.0.0.0:2003: accept4: too many open files"}
[2021-05-07T03:04:10.056-0500] WARN [tcp] failed to accept connection {"error": "accept tcp 0.0.0.0:2003: accept4: too many open files"}
[2021-05-07T03:04:12.059-0500] WARN [tcp] failed to accept connection {"error": "accept tcp 0.0.0.0:2003: accept4: too many open files"}
[2021-05-07T03:04:12.059-0500] WARN [tcp] failed to accept connection {"error": "accept tcp 0.0.0.0:2003: accept4: too many open files"}
[2021-05-07T03:04:12.059-0500] WARN [tcp] failed to accept connection {"error": "accept tcp 0.0.0.0:2003: accept4: too many open files"}
When looking at lsof for the go-carbon process there are thousands of sockets open
go-carbon 9236 carbon *509u sock 0,7 0t0 4143452374 protocol: TCP
go-carbon 9236 carbon *510u sock 0,7 0t0 4143427564 protocol: TCP
go-carbon 9236 carbon *513u sock 0,7 0t0 4143452376 protocol: TCP
go-carbon 9236 carbon *515u sock 0,7 0t0 4143409091 protocol: TCP
go-carbon 9236 carbon *517u sock 0,7 0t0 4143427565 protocol: TCP
go-carbon 9236 carbon *519u sock 0,7 0t0 4143427581 protocol: TCP
go-carbon 9236 carbon *521u sock 0,7 0t0 4143427583 protocol: TCP
go-carbon 9236 carbon *523u sock 0,7 0t0 4143452453 protocol: TCP
go-carbon 9236 carbon *526u sock 0,7 0t0 4143445316 protocol: TCP
go-carbon 9236 carbon *527u sock 0,7 0t0 4143455234 protocol: TCP
go-carbon 9236 carbon *528u sock 0,7 0t0 4143452454 protocol: TCP
[root@graphite1 ~]# lsof -p 9236|grep sock -c
17760
Along with thousands of entries for TCP connections to kafka
go-carbon 9236 carbon *605u IPv4 4143714097 0t0 TCP graphite1.x.x.x.x:47778->x-x-x-x.x:copycat (ESTABLISHED)
[root@graphite1 ~]# lsof -p 9236|grep 50-87 -c
17928
Go-carbon Configuration:
[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "udp://x.x.x.x:2003"
metric-interval = "1m0s"
max-cpu = 4
[whisper]
data-dir = "/var/lib/graphite/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
workers = 8
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = true
enabled = true
hash-filenames = true
compressed = false
remove-empty-file = false
[cache]
max-size = 1000000
write-strategy = "max"
[udp]
listen = ":2003"
enabled = true
buffer-size = 0
[tcp]
listen = ":2003"
enabled = true
buffer-size = 0
[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0
[receiver.kafkapart0]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart0.state"
partition = 0
[receiver.kafkapart1]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart1.state"
partition = 1
[receiver.kafkapart2]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart2.state"
partition = 2
[receiver.kafkapart3]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart3.state"
partition = 3
[receiver.kafkapart4]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart4.state"
partition = 4
[receiver.kafkapart5]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart5.state"
partition = 5
[receiver.kafkapart6]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart6.state"
partition = 6
[receiver.kafkapart7]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart7.state"
partition = 7
[receiver.kafkapart8]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart8.state"
partition = 8
[receiver.kafkapart9]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart9.state"
partition = 9
[receiver.kafkapart10]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart10.state"
partition = 10
[receiver.kafkapart11]
protocol = "kafka"
parse-protocol = "plain"
brokers = [ "kafka1:9093", "kafka2:9093", "kafka3:9093" ]
topic = "user-graphite"
kafka-version = "2.6.0"
state-file = "/var/lib/graphite/kafkapart11.state"
partition = 11
[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
[grpc]
listen = "127.0.0.1:7003"
enabled = false
[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/var/lib/graphite/tagging/"
tagdb-timeout = "1s"
[carbonserver]
listen = "127.0.0.1:8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "60s"
write-timeout = "60s"
query-cache-enabled = true
query-cache-size-mb = 0
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
trie-index = false
cache-scan = false
max-globs = 100
fail-on-max-globs = false
max-metrics-globbed = 30000
max-metrics-rendered = 1000
graphite-web-10-strict-mode = true
internal-stats-dir = ""
stats-percentiles = [99, 98, 95, 75, 50]
[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0
[pprof]
listen = "localhost:7007"
enabled = false
[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
Problem with kafka listener is that it was never used in production (or at least I'm not aware of that) so it might contain bugs, but I (AFAIR I was the only one who touched that code) don't have a test environment for it.
Probably you need to raise open files limit. Having section per partition is very exhausting. 1024 or even 4096 could be too small.
On Wed, 12 May 2021 at 22:17, Vladimir Smirnov @.***> wrote:
Problem with kafka listener is that it was never used in production (or at least I'm not aware of that) so it might contain bugs, but I (AFAIR I was the only one who touched that code) don't have a test environment for it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/go-graphite/go-carbon/issues/412#issuecomment-840068833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJLTVWWPGHE6FIZIMNT3LDTNLO5XANCNFSM44Y6SSOQ .
go-carbon 9236 carbon *509u sock 0,7 0t0 4143452374 protocol: TCP
go-carbon 9236 carbon *510u sock 0,7 0t0 4143427564 protocol: TCP
go-carbon 9236 carbon *513u sock 0,7 0t0 4143452376 protocol: TCP
go-carbon 9236 carbon *515u sock 0,7 0t0 4143409091 protocol: TCP
go-carbon 9236 carbon *517u sock 0,7 0t0 4143427565 protocol: TCP
go-carbon 9236 carbon *519u sock 0,7 0t0 4143427581 protocol: TCP
go-carbon 9236 carbon *521u sock 0,7 0t0 4143427583 protocol: TCP
go-carbon 9236 carbon *523u sock 0,7 0t0 4143452453 protocol: TCP
go-carbon 9236 carbon *526u sock 0,7 0t0 4143445316 protocol: TCP
go-carbon 9236 carbon *527u sock 0,7 0t0 4143455234 protocol: TCP
go-carbon 9236 carbon *528u sock 0,7 0t0 4143452454 protocol: TCP
[root@graphite1 ~]# lsof -p 9236|grep sock -c
17760
This is odd because those socks doesn't have any dst and src info. https://serverfault.com/questions/1000338/in-lsof-output-what-are-those-sock-lines
@cgwilsonjr are you seeing any error logs in go-carbon.log or go-carbon-access.log? , like failed to connect to kafka etc?
So @deniszh raising Open Files limit would not have resolved this, as it was opening 12 new connections every minute that appear were not closing properly and they were not timing out so they were just staying open. The file limit on this was already set t 55,000 .
@bom-d-van it looks like I missed some errors in the log that might have been the cause of this.
[2021-05-13T07:32:54.183-0500] ERROR [kafkapart0] failed to connect to kafka {"reconnect_interval": 60, "error": "kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition."}
However I feel like this should do something more graceful than load the process up with thousands of file descriptors to the point in which it causes it to reach the max open files set in ulimit causing errors and eventually go-carbon to spew tons of errors and fill a partition.
What I had to do was wipe out the state files and let them regenerate with the proper offset that the kafka server was on.
Hi @cgwilsonjr yes, you are right. Glad you sorted it out. Thanks for reporting the issue, I have pushed a fix for this issue.
With the changes, kafka receiver will release/clean-up the resources if it fails to consume a partition. And specifically, for the out of range errors you were having, it would falls back to the oldest offset now.