fluentd
fluentd copied to clipboard
‘Connection reset by peer’ issues happen when logs are sent to td-agent containers
Describe the bug
Hi Team,
The following is the logging flow in our system
application(fluency)->cloud load balancer->td-agent pods deployed in Kubernetes cluster
fluency is an application libary which is used to send logs to td-agent: https://github.com/komamitsu/fluency
fluentd version:
sh-4.2# td-agent --version
td-agent 1.10.2
So periodically, we see connection reset by peer
error logs, can you help to if the td-agent pods reset the connection peers?
Best regards
To Reproduce
Please refer to the error logs
Expected behavior
No connection reset by peer
issues happen
Your Environment
- Fluentd version:
- TD Agent version: td-agent 1.10.2
- Operating system: Linux
- Kernel version:
Your Configuration
<system>
workers "#{ENV['WORKERS']}"
log_level trace
</system>
<source>
@type forward
@id input1
port 24224
</source>
<source>
@type monitor_agent
bind 0.0.0.0
port 24220
</source>
<label @FLUENT_LOG>
<match fluent.**>
@type stdout
</match>
</label>
<match "#{ENV['SERVICE']}.**">
@type copy
<store>
@type forward
expire_dns_cache 60
dns_round_robin true
heartbeat_type transport
<server>
name log-aggregator
host "#{ENV['AGGREGATOR']}"
port "#{ENV['AGGREGATOR_PORT']}"
weight 60
</server>
<buffer>
@type file
path "/var/log/fluent/#{ENV['SERVICE']}"
retry_max_times 50
flush_interval 10s
flush_at_shutdown true
</buffer>
</store>
</match>
Your Error Log
The following are application fluency logs when connection reset by peer
issues happen?
56007051 WARN [pool-1-thread-1] org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender - Sender failed to send data. sender=RetryableSender{baseSender=TCPSender{config=Config{host='xxx', port=24224, connectionTimeoutMilli=5000, readTimeoutMilli=5000, waitBeforeCloseMilli=1000} Config{senderErrorHandler=null}} NetworkSender{config=Config{host='siv-admin-cluster-log.perf.fastretailing.cn', port=24224, connectionTimeoutMilli=5000, readTimeoutMilli=5000, waitBeforeCloseMilli=1000} Config{senderErrorHandler=null}, failureDetector=null} org.komamitsu.fluency.fluentd.ingester.sender.TCPSender@52e919e6, retryStrategy=ExponentialBackOffRetryStrategy{config=Config{baseIntervalMillis=400, maxIntervalMillis=30000} Config{maxRetryCount=7}} RetryStrategy{config=Config{baseIntervalMillis=400, maxIntervalMillis=30000} Config{maxRetryCount=7}}, isClosed=false} org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender@116c37a0, retry=0
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130)
at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493)
at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507)
at org.komamitsu.fluency.fluentd.ingester.sender.TCPSender.sendBuffers(TCPSender.java:86)
at org.komamitsu.fluency.fluentd.ingester.sender.TCPSender.sendBuffers(TCPSender.java:31)
at org.komamitsu.fluency.fluentd.ingester.sender.NetworkSender.sendInternal(NetworkSender.java:102)
at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.sendInternalWithRestoreBufferPositions(FluentdSender.java:74)
at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.send(FluentdSender.java:56)
at org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender.sendInternal(RetryableSender.java:77)
at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.sendInternalWithRestoreBufferPositions(FluentdSender.java:74)
at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.send(FluentdSender.java:56)
at org.komamitsu.fluency.fluentd.ingester.FluentdIngester.ingest(FluentdIngester.java:87)
at org.komamitsu.fluency.buffer.Buffer.flushInternal(Buffer.java:357)
at org.komamitsu.fluency.buffer.Buffer.flush(Buffer.java:112)
at org.komamitsu.fluency.flusher.Flusher.runLoop(Flusher.java:66)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Meanwhile,the logs in td-agent pods look like the following when connection reset by peer
issues happen
2022-06-20 02:25:16 +0000 [debug]: #1 Created new chunk chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:16.325249836 +0000 fluent.debug: {"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"Created new chunk chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 enqueueing chunk instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.291859732 +0000 fluent.trace: {"instance":25625960,"metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"enqueueing chunk instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 chunk dequeued instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.506134644 +0000 fluent.trace: {"instance":25625960,"metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"chunk dequeued instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 purging a chunk instance=25625960 chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.508864391 +0000 fluent.trace: {"instance":25625960,"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"purging a chunk instance=25625960 chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 chunk purged instance=25625960 chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.509033136 +0000 fluent.trace: {"instance":25625960,"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"chunk purged instance=25625960 chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
### Additional context
_No response_
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days