parallel-consumer icon indicating copy to clipboard operation
parallel-consumer copied to clipboard

Max loading factor steps reached: 100/100

Open bartman64 opened this issue 3 years ago • 19 comments
trafficstars

Hi,

today we noticed an issue on our service. It stopped processing data and when we looked into the logs we saw a lot of messages stating: "isPoolQueueLow(): Max loading steps reached 100/100"

After restarting the service it run again properly, but still I wanted to now if the logs were any indicator for the error, or if it was something else. When does this WARN message occur?

2022-08-26_08h24_57

bartman64 avatar Aug 26 '22 13:08 bartman64

Can you provide more logs / did you capture them? earlier entries? (you can email them to me)

astubbs avatar Sep 02 '22 09:09 astubbs

Attached log, maybe it helps.

com-abc-rest-caller-testenv.log

bartman64 avatar Sep 06 '22 15:09 bartman64

Any idea when this WARN message could occur?

bartman64 avatar Sep 21 '22 15:09 bartman64

Hi, yes it's supposed to be when the processing is going faster than records can be queued up for processing - which is usually an indication of something going wrong. This issue is in the queue, will look more into it soon. However, it's a bit of a doozy. I don't suppose you noticed anything weird with the record processing when this happened? Was everything else appearing to be fine?

We need to add the metrics end point / info so when this happens we can get more direct info from the internal state of PC.

astubbs avatar Sep 23 '22 10:09 astubbs

How often has this happened? Are you able to reproduce it? Was throughput going slow when this happened?

It's a relatively arbitrary max for the loading factor - but usually 3-5x is enough. 100 should be way too much. It prevents potential OOM errors with the queue sizes.

astubbs avatar Sep 23 '22 10:09 astubbs

Hi, unfortunately we don't noticed anything strange before when this happens. Last time it happened was yesterday and I tried to find some kind of indicator to why this is happening, but from the logs everything seems to work fine, until out of nowhere those 100/100 log message appear.

How this works for us is that we get throughout the night a batch of data that our services process and in the morning we check if everything worked as expected. When we noticed that no data was processed that night, it was usually because of that error. But this is something we can't really reproduces, since when restarting the service it works suddenly again.

"when the processing is going faster than records can be queued up for processing" This is an interesting information, since how our service works is, it checks via a flag, if we even have to send the data to an endpoint or not. If that is not the case the processing is then probably way faster compared to calling the endpoint and waiting for a response. If a batch of that type arrives the processing could be way faster then the queueing up of new message...

But why does the processing completely stop when it reaches max?

How often has this happened? In average once a week, I think

Was throughput going slow when this happened? Didn't have an eye on that yet

Thx for the response :)

bartman64 avatar Sep 23 '22 11:09 bartman64

So when this happens - no messages are getting processed? That's not what I expected 🤔

flag

ah - if this flag is set, your function returns immediately, without really doing much? Yes that could cause it. However, it shouldn't prevent any progress being made - it's just the system telling you that it's not going to increase the loading factor beyond 100. (which means that the queued messages is about 100 * maxConcurrency setting)...

But why does the processing completely stop when it reaches max?

yeah that shouldn't be the case.. what indication do you see that no progress is being made?

astubbs avatar Sep 23 '22 12:09 astubbs

So we are using KafkaHQ to access our kafka topics and there we see that our service consumer group has a lag which doesn't decrease. And also we no longer see requests getting received from the target endpoint. Maybe the reason for the stop of processing is different and the WARN log message is just a result of that, because the queue doesn't get filled...

bartman64 avatar Sep 30 '22 06:09 bartman64

Which version are you running?

Are you not getting any failed processing logs?

Can you show your processing code?

astubbs avatar Oct 10 '22 11:10 astubbs

We were running version 0.5.2.0, just upgraded to 0.5.2.3 Wasn't able to capture any failed processing logs before. Maybe it could make sense to organize a call session somehow at some point, were we could discuss this and could go through the code?

bartman64 avatar Oct 19 '22 04:10 bartman64

yes, sure we can do that - please email me directly, or contact me on community slack.

astubbs avatar Oct 20 '22 18:10 astubbs

Ok thanks, I will ping you, but currently I am busy with other topics ;)

bartman64 avatar Nov 04 '22 15:11 bartman64

FYI, I'm going to add a limiter so that the logs aren't spammed with this. And - the whole system is going to be removed in upcoming versions, so won't be an issue then.

astubbs avatar Dec 09 '22 08:12 astubbs

Hi, @astubbs. We are experiencing this issue now with version 0.5.2.3. When this happened, no messages were consumed from the assigned partition like what @bartman64 encountered previously. Furthermore, it's relatively easy to reproduce this issue. Restarting the problematic services didn't solve the issue because the issue did come back soon. Do you have any suggestions to mitigate this issue? Thanks!

colinkuo avatar Jan 20 '23 23:01 colinkuo

Thanks @colinkuo for the report! Just to check - was PC still making progress? And you just see it spamming the logs with this error? .5 (releasing next week) has added a rate limiter to the warning. It’s not something to be too concerned about - it should only be performance related.

Can you tell me what your processing function does, or was doing at the time? Is it possible your function was completing records immediately with no delay?

I’m very much looking forward to merging the changes in 0.6 which will remove this system :)

astubbs avatar Jan 21 '23 10:01 astubbs

Hi, @astubbs When the error was spamming the logs, the CPU usage of the pods was low, and the offsets of assigned partitions were not moving fast enough. The outcome we've seen was that lag was built up only on the partitions assigned to the problematic Kubernetes pods. I cannot tell whether it was stuck or processed slowly.

According to our user function, it can take five ms on average. I'd say there might be some delay in processing a message in the user function.

Would you please elaborate on how removing the system can address this issue? Thanks!

colinkuo avatar Jan 24 '23 00:01 colinkuo

Hi @colinkuo, @bartman64 - do you still experience this issue with version 5.0.2.7 - that has fixes for #547 and #606?

I am wondering if the issue here is the same as reported in #547...

The #637 seems to be somewhat related - in that it gets same warning logged - but it seems to be caused by commit during rebalance handling rather than PC not dropping stale work from queues correctly.

rkolesnev avatar Sep 12 '23 08:09 rkolesnev

Hi, we didn't run into that particular issue anymore :)

bartman64 avatar Sep 12 '23 13:09 bartman64

Hi @rkolesnev We are starting to test 5.0.2.7 and will keep you posted if any updates. Thanks!

colinkuo avatar Sep 15 '23 05:09 colinkuo