client_java icon indicating copy to clipboard operation
client_java copied to clipboard

Synchronisation issue in Buffer causing metrics scraping failure

Open markusaaltosc opened this issue 1 year ago • 2 comments
trafficstars

Fixed a synchronisation issue in Buffer where changes made by the doAppend() method may not always be visible to the run() method. With the bug the implementation was stuck in busy loop stopping the metrics scraping. The issue surfaced when running in AWS Graviton ARM processor with 5s scraping interval and large number of simultaneous threads constantly adding new observations. Once fixed and the doAppend changes becoming visible it was also possible the bufferPos to be larger than expectedBufferSize (if doAppend() was called after expectedBufferSize was calculated and before appendLock was acquired in runMethod) and that is now also fixed by ensuring we stay in the loop only if bufferPos < expectedBufferSize. Please @fstab check this.

markusaaltosc avatar Jul 03 '24 10:07 markusaaltosc

Thanks a lot @markusaaltosc, I'll have a look.

We did extensive long-running load tests on Intel / AMD64 CPUs, but never on Graviton ARM.

I'll get back to you as soon as I found the time to review.

fstab avatar Jul 03 '24 15:07 fstab

We have now been running with the fix for few weeks already both on Intel and Graviton so looking stable so far.

markusaaltosc avatar Jul 04 '24 08:07 markusaaltosc

@markusaaltosc I've created an alternative fix - https://github.com/prometheus/client_java/pull/991 - please have a look

zeitlinger avatar Oct 01 '24 12:10 zeitlinger