telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

Add a new metric 'StolenTargetMemoryRatio' to ‎sqlPerformanceMetrics in telegraf/plugins/inputs/sqlserver/sqlqueriesV1.go

Open chenteddybear opened this issue 2 years ago • 6 comments

Use Case

This metric will help identify potential memory leak introduced in sql 2019 CU14 where MEMORYCLERK_SOSNODE takes up exessive amount of available memory without releasing them, resulting in crashing the sql server. MEMORYCLERK_SOSNODE is categorized under 'Stolen Memory' in SQL Server.

UNION ALL SELECT 'Stolen target memory ratio', StolenTargetMemoryRatio = 100.0 * cntr_value / (SELECT 1.0*cntr_value FROM sys.dm_os_performance_counters WHERE counter_name = 'Target Server Memory (KB)') FROM sys.dm_os_performance_counters WHERE counter_name = 'Stolen Server Memory (KB)

Expected behavior

When the value of Stolen/Target memory goes above 50% or even 60% it can be considered alarming for DBA.

Actual behavior

it returns the ratio of stolen/target memory in sql server.

Additional info

https://learn.microsoft.com/en-us/answers/questions/826973/what-can-cause-memoryclerk-sosnode-clerk-gradually

chenteddybear avatar Feb 12 '24 05:02 chenteddybear

@Trovalo does that make sense to you?

@chenteddybear a pull-request to add this is welcome... ;-)

srebhan avatar Feb 12 '24 13:02 srebhan

it costs nothing to add this counter, it's just an additional string in the WHERE filter of the perfCounter query.

About how useful it might be... I usually use the "Page Life Expectancy" to check for memory pressure, but everyone has his ways

Trovalo avatar Feb 13 '24 11:02 Trovalo

@chenteddybear,

Is using the Page Life Expectancy something you have considered?

Otherwise, are you going to put up a PR?

Thanks!

powersj avatar Feb 13 '24 13:02 powersj

Hi @powersj Page Life Expectancy is not enough to indicate memory pressure / memory leak. The new metric may be useful in advising DBAs to investigate SQL memory clerks for odd behavior while troubleshooting.

As for a PR, I'm not sure exactly how to proceed, as it'll be my first time. Thank you.

chenteddybear avatar Feb 13 '24 14:02 chenteddybear

@chenteddybear in 20-30mins there will be artifacts that you can download attached to the following PR: https://github.com/influxdata/telegraf/pull/14792

Could you please download and confirm that the new metric is visible? Thanks

powersj avatar Feb 13 '24 16:02 powersj

That would be great to get it added as that a very useful stat to collect.

I'd mentioned it back here #9112 that it was missing from v2 stats onwards as well as Database Cache Memory and Free Memory, but never chased it up as just had to do it via an exec command.

It would be great to get these other two stats added in too.

simonwhybrow avatar Feb 16 '24 09:02 simonwhybrow

@chenteddybear and @simonwhybrow can you please test the binary in PR #14792 and verify it produces what you expect!?

srebhan avatar Feb 19 '24 09:02 srebhan

@chenteddybear and @simonwhybrow can you please test the binary in PR #14792 and verify it produces what you expect!?

Yep, looks good. I can see it pulling through Stolen memory. Any chance we can get the other two metrics added as mentioned in #9112 - Database Cache Memory (KB) and Free Memory (KB)? Or does that need to be a separate PR?

simonwhybrow avatar Feb 19 '24 10:02 simonwhybrow

@powersj do you think we should add the two metric mentioned by @simonwhybrow?

srebhan avatar Feb 19 '24 11:02 srebhan

I've updated the PR with the new additional metrics.

@simonwhybrow @chenteddybear,

Can you verify the updates look correct. Thanks

powersj avatar Feb 20 '24 17:02 powersj

@powersj looks good. Thank you for adding those counters.

simonwhybrow avatar Feb 20 '24 23:02 simonwhybrow

looks great! thank you for your help!

chenteddybear avatar Feb 21 '24 00:02 chenteddybear