parallel-consumer
parallel-consumer copied to clipboard
Max loading factor steps reached: 100/100
Hi,
today we noticed an issue on our service. It stopped processing data and when we looked into the logs we saw a lot of messages stating: "isPoolQueueLow(): Max loading steps reached 100/100"
After restarting the service it run again properly, but still I wanted to now if the logs were any indicator for the error, or if it was something else. When does this WARN message occur?

Can you provide more logs / did you capture them? earlier entries? (you can email them to me)
Any idea when this WARN message could occur?
Hi, yes it's supposed to be when the processing is going faster than records can be queued up for processing - which is usually an indication of something going wrong. This issue is in the queue, will look more into it soon. However, it's a bit of a doozy. I don't suppose you noticed anything weird with the record processing when this happened? Was everything else appearing to be fine?
We need to add the metrics end point / info so when this happens we can get more direct info from the internal state of PC.
How often has this happened? Are you able to reproduce it? Was throughput going slow when this happened?
It's a relatively arbitrary max for the loading factor - but usually 3-5x is enough. 100 should be way too much. It prevents potential OOM errors with the queue sizes.
Hi, unfortunately we don't noticed anything strange before when this happens. Last time it happened was yesterday and I tried to find some kind of indicator to why this is happening, but from the logs everything seems to work fine, until out of nowhere those 100/100 log message appear.
How this works for us is that we get throughout the night a batch of data that our services process and in the morning we check if everything worked as expected. When we noticed that no data was processed that night, it was usually because of that error. But this is something we can't really reproduces, since when restarting the service it works suddenly again.
"when the processing is going faster than records can be queued up for processing" This is an interesting information, since how our service works is, it checks via a flag, if we even have to send the data to an endpoint or not. If that is not the case the processing is then probably way faster compared to calling the endpoint and waiting for a response. If a batch of that type arrives the processing could be way faster then the queueing up of new message...
But why does the processing completely stop when it reaches max?
How often has this happened? In average once a week, I think
Was throughput going slow when this happened? Didn't have an eye on that yet
Thx for the response :)
So when this happens - no messages are getting processed? That's not what I expected 🤔
flag
ah - if this flag is set, your function returns immediately, without really doing much? Yes that could cause it. However, it shouldn't prevent any progress being made - it's just the system telling you that it's not going to increase the loading factor beyond 100. (which means that the queued messages is about 100 * maxConcurrency setting)...
But why does the processing completely stop when it reaches max?
yeah that shouldn't be the case.. what indication do you see that no progress is being made?
So we are using KafkaHQ to access our kafka topics and there we see that our service consumer group has a lag which doesn't decrease. And also we no longer see requests getting received from the target endpoint. Maybe the reason for the stop of processing is different and the WARN log message is just a result of that, because the queue doesn't get filled...
Which version are you running?
Are you not getting any failed processing logs?
Can you show your processing code?
We were running version 0.5.2.0, just upgraded to 0.5.2.3 Wasn't able to capture any failed processing logs before. Maybe it could make sense to organize a call session somehow at some point, were we could discuss this and could go through the code?
yes, sure we can do that - please email me directly, or contact me on community slack.
Ok thanks, I will ping you, but currently I am busy with other topics ;)
FYI, I'm going to add a limiter so that the logs aren't spammed with this. And - the whole system is going to be removed in upcoming versions, so won't be an issue then.
Hi, @astubbs. We are experiencing this issue now with version 0.5.2.3. When this happened, no messages were consumed from the assigned partition like what @bartman64 encountered previously. Furthermore, it's relatively easy to reproduce this issue. Restarting the problematic services didn't solve the issue because the issue did come back soon. Do you have any suggestions to mitigate this issue? Thanks!
Thanks @colinkuo for the report! Just to check - was PC still making progress? And you just see it spamming the logs with this error? .5 (releasing next week) has added a rate limiter to the warning. It’s not something to be too concerned about - it should only be performance related.
Can you tell me what your processing function does, or was doing at the time? Is it possible your function was completing records immediately with no delay?
I’m very much looking forward to merging the changes in 0.6 which will remove this system :)
Hi, @astubbs When the error was spamming the logs, the CPU usage of the pods was low, and the offsets of assigned partitions were not moving fast enough. The outcome we've seen was that lag was built up only on the partitions assigned to the problematic Kubernetes pods. I cannot tell whether it was stuck or processed slowly.
According to our user function, it can take five ms on average. I'd say there might be some delay in processing a message in the user function.
Would you please elaborate on how removing the system can address this issue? Thanks!
Hi @colinkuo, @bartman64 - do you still experience this issue with version 5.0.2.7 - that has fixes for #547 and #606?
I am wondering if the issue here is the same as reported in #547...
The #637 seems to be somewhat related - in that it gets same warning logged - but it seems to be caused by commit during rebalance handling rather than PC not dropping stale work from queues correctly.
Hi, we didn't run into that particular issue anymore :)
Hi @rkolesnev We are starting to test 5.0.2.7 and will keep you posted if any updates. Thanks!