In async model , buffer overflow occurs frequently
We use the lua-resty-kafka to send our log to the kafka. The qps is 6K+ , size per request is 0.6K. However we see many buffer overflow errors in the errlog. Andr i found this error in ringbuffer.lua .
function _M.add(self, topic, key, message)
local num = self.num
local size = self.size
if num >= size then
return nil, "buffer overflow"
end
local index = (self.start + num) % size
local queue = self.queue
queue[index] = topic
queue[index + 1] = key
queue[index + 2] = message
self.num = num + 3
return true
end
What config should i set? And what does this error mean?
I have the same problem! The opt.max_buffering set to default value(50000), but the library print 'buffer overflow' when QPS is greater than 50. I debug the function _M.add , the self.size is 150000, Prove that the configuration is correct. @doujiang24
@IvyTang @logbird This usually means the network between the producer and kafka server is not really fast. Does the producer and kafka server are in the same datacenter?
咱俩还是说中文吧。。。。 我这边openresty 和 kafka服务部署在同一个机房,目前线上的量不应该出现这个问题的。 我再开发环境复现这个问题的时候,发现有办法可以发现出这个情况。 方法如下: ab 对 openresty 的 productor进行压测并发数50。 关闭kafka服务,模拟kafka故障的情况。 这时候kafka无法工作,但是openresty的buffer数量并不会增长,只是会在errlog里打印错误信息。然后消息丢失。 但是。这个时候如果对nginx进行reload操作,reload后buffer数量会只增不减,直到触发overflow。 这个问题产生方式可能与目前线上的触发方式不一样,因为线上的kafka服务一直是Ok的。。
所以还请帮忙查看一下,另外如果方便,希望可以QQ进行交流:1027672948 @doujiang24
any solution on this issue?
I've came into this problem, and solved it. share it for others:
the root cause of this:
- there is ngx-worker(level) level lock to flush the message: get data from ringbuffer, fill into sending buffer, and request kafka, then relealse the lock
- only one co-routine can get the lock
- other co-routine can not get the lock if kafka request is not finished, will just quit
- if the co-routine got the lock blocks(because of kafka), then the ringbuffer will get overflow
the solution to this problem:
- let one and only one co-routine consume all the messages in ringbuffer
- make sure that the co-routine has higher speed in consuming messages from ringbuffer than producing. which we need to adjust these params(in producer config):
-
batch_num(eg. >= the max QPS of the message producing) -
flush_time(eg. 1000 ms) -
max_buffering(leave enough time for blocked kalfka request to finish or terminated, which means it must >= batch_num * (socket_timeout/1000 or 3) )
-
[memo] The worker level lock can not be removed, will cause same message being read by multiple co-routine in timers in high QPS cases(which means message will be duplicated). This is wired as only one co-routine can get the cpu, and there should be no racing condition.