fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

‘Connection reset by peer’ issues happen when logs are sent to td-agent containers

Open frsh-augustin opened this issue 2 years ago • 1 comments

Describe the bug

Hi Team,

The following is the logging flow in our system application(fluency)->cloud load balancer->td-agent pods deployed in Kubernetes cluster fluency is an application libary which is used to send logs to td-agent: https://github.com/komamitsu/fluency fluentd version:

sh-4.2# td-agent --version
td-agent 1.10.2

So periodically, we see connection reset by peer error logs, can you help to if the td-agent pods reset the connection peers?

Best regards

To Reproduce

Please refer to the error logs

Expected behavior

No connection reset by peer issues happen

Your Environment

- Fluentd version:
- TD Agent version: td-agent 1.10.2
- Operating system: Linux
- Kernel version:

Your Configuration

<system>
  workers "#{ENV['WORKERS']}"
  log_level trace
</system>

<source>
  @type  forward
  @id    input1
  port   24224
</source>

<source>
  @type monitor_agent
  bind 0.0.0.0
  port 24220
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type stdout
  </match>
</label>

<match "#{ENV['SERVICE']}.**">
  @type copy
  <store>
    @type forward
    expire_dns_cache 60
    dns_round_robin true
    heartbeat_type transport
    <server>
      name log-aggregator
      host "#{ENV['AGGREGATOR']}"
      port "#{ENV['AGGREGATOR_PORT']}"
      weight 60
    </server>
    <buffer>
      @type file
      path "/var/log/fluent/#{ENV['SERVICE']}"
      retry_max_times 50
      flush_interval 10s
      flush_at_shutdown true
    </buffer>
  </store>
</match>

Your Error Log

The following are application fluency logs when connection reset by peer issues happen?

56007051 WARN [pool-1-thread-1] org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender - Sender failed to send data. sender=RetryableSender{baseSender=TCPSender{config=Config{host='xxx', port=24224, connectionTimeoutMilli=5000, readTimeoutMilli=5000, waitBeforeCloseMilli=1000} Config{senderErrorHandler=null}} NetworkSender{config=Config{host='siv-admin-cluster-log.perf.fastretailing.cn', port=24224, connectionTimeoutMilli=5000, readTimeoutMilli=5000, waitBeforeCloseMilli=1000} Config{senderErrorHandler=null}, failureDetector=null} org.komamitsu.fluency.fluentd.ingester.sender.TCPSender@52e919e6, retryStrategy=ExponentialBackOffRetryStrategy{config=Config{baseIntervalMillis=400, maxIntervalMillis=30000} Config{maxRetryCount=7}} RetryStrategy{config=Config{baseIntervalMillis=400, maxIntervalMillis=30000} Config{maxRetryCount=7}}, isClosed=false} org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender@116c37a0, retry=0
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:182)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:130)
        at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:493)
        at java.base/java.nio.channels.SocketChannel.write(SocketChannel.java:507)
        at org.komamitsu.fluency.fluentd.ingester.sender.TCPSender.sendBuffers(TCPSender.java:86)
        at org.komamitsu.fluency.fluentd.ingester.sender.TCPSender.sendBuffers(TCPSender.java:31)
        at org.komamitsu.fluency.fluentd.ingester.sender.NetworkSender.sendInternal(NetworkSender.java:102)
        at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.sendInternalWithRestoreBufferPositions(FluentdSender.java:74)
        at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.send(FluentdSender.java:56)
        at org.komamitsu.fluency.fluentd.ingester.sender.RetryableSender.sendInternal(RetryableSender.java:77)
        at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.sendInternalWithRestoreBufferPositions(FluentdSender.java:74)
        at org.komamitsu.fluency.fluentd.ingester.sender.FluentdSender.send(FluentdSender.java:56)
        at org.komamitsu.fluency.fluentd.ingester.FluentdIngester.ingest(FluentdIngester.java:87)
        at org.komamitsu.fluency.buffer.Buffer.flushInternal(Buffer.java:357)
        at org.komamitsu.fluency.buffer.Buffer.flush(Buffer.java:112)
        at org.komamitsu.fluency.flusher.Flusher.runLoop(Flusher.java:66)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Meanwhile,the logs in td-agent pods look like the following when connection reset by peer issues happen

2022-06-20 02:25:16 +0000 [debug]: #1 Created new chunk chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:16.325249836 +0000 fluent.debug: {"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"Created new chunk chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 enqueueing chunk instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.291859732 +0000 fluent.trace: {"instance":25625960,"metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"enqueueing chunk instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 chunk dequeued instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.506134644 +0000 fluent.trace: {"instance":25625960,"metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"chunk dequeued instance=25625960 metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 purging a chunk instance=25625960 chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.508864391 +0000 fluent.trace: {"instance":25625960,"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"purging a chunk instance=25625960 chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}
2022-06-20 02:25:29 +0000 [trace]: #1 chunk purged instance=25625960 chunk_id="5e1d7d087b144adec5bd0ede54a4bb85" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error", variables=nil, seq=0>
2022-06-20 02:25:29.509033136 +0000 fluent.trace: {"instance":25625960,"chunk_id":"5e1d7d087b144adec5bd0ede54a4bb85","metadata":"#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>","message":"chunk purged instance=25625960 chunk_id=\"5e1d7d087b144adec5bd0ede54a4bb85\" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=\"store-sales.gu.cn.accounting_register_worker.accounting-register-worker-gu-cn-deploy-7d9d759d5f-w68dl.error\", variables=nil, seq=0>"}


### Additional context

_No response_

frsh-augustin avatar Jun 20 '22 07:06 frsh-augustin

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] avatar Sep 18 '22 10:09 github-actions[bot]