lua-resty-kafka In async model , buffer overflow occurs frequently

We use the lua-resty-kafka to send our log to the kafka. The qps is 6K+ , size per request is 0.6K. However we see many buffer overflow errors in the errlog. Andr i found this error in ringbuffer.lua .

function _M.add(self, topic, key, message)
    local num = self.num
    local size = self.size

    if num >= size then
        return nil, "buffer overflow"
    end

    local index = (self.start + num) % size
    local queue = self.queue

    queue[index] = topic
    queue[index + 1] = key
    queue[index + 2] = message

    self.num = num + 3

    return true
end

What config should i set? And what does this error mean?

Jan 13 '17 03:01 IvyTang

I have the same problem! The opt.max_buffering set to default value(50000), but the library print 'buffer overflow' when QPS is greater than 50. I debug the function _M.add , the self.size is 150000, Prove that the configuration is correct. @doujiang24

Mar 06 '17 12:03 logbird

@IvyTang @logbird This usually means the network between the producer and kafka server is not really fast. Does the producer and kafka server are in the same datacenter?

Mar 06 '17 14:03 doujiang24

咱俩还是说中文吧。。。。我这边openresty 和 kafka服务部署在同一个机房，目前线上的量不应该出现这个问题的。我再开发环境复现这个问题的时候，发现有办法可以发现出这个情况。方法如下： ab 对 openresty 的 productor进行压测并发数50。关闭kafka服务，模拟kafka故障的情况。这时候kafka无法工作，但是openresty的buffer数量并不会增长，只是会在errlog里打印错误信息。然后消息丢失。但是。这个时候如果对nginx进行reload操作，reload后buffer数量会只增不减，直到触发overflow。这个问题产生方式可能与目前线上的触发方式不一样，因为线上的kafka服务一直是Ok的。。

所以还请帮忙查看一下，另外如果方便，希望可以QQ进行交流：1027672948 @doujiang24

Mar 07 '17 01:03 logbird

any solution on this issue?

Nov 10 '19 05:11 Yuanxiangz

I've came into this problem, and solved it. share it for others:

the root cause of this:

there is ngx-worker(level) level lock to flush the message: get data from ringbuffer, fill into sending buffer, and request kafka, then relealse the lock
only one co-routine can get the lock
other co-routine can not get the lock if kafka request is not finished, will just quit
if the co-routine got the lock blocks(because of kafka), then the ringbuffer will get overflow

the solution to this problem:

let one and only one co-routine consume all the messages in ringbuffer
make sure that the co-routine has higher speed in consuming messages from ringbuffer than producing. which we need to adjust these params(in producer config):
- batch_num (eg. >= the max QPS of the message producing)
- flush_time (eg. 1000 ms)
- max_buffering (leave enough time for blocked kalfka request to finish or terminated, which means it must >= batch_num * (socket_timeout/1000 or 3) )

May 13 '21 08:05 TommyU

[memo] The worker level lock can not be removed, will cause same message being read by multiple co-routine in timers in high QPS cases(which means message will be duplicated). This is wired as only one co-routine can get the cpu, and there should be no racing condition.

May 13 '21 09:05 TommyU