harvest icon indicating copy to clipboard operation
harvest copied to clipboard

svm_nfs_ops is reporting Billion IOPs for an SVM with 10 nodes.

Open jmg011 opened this issue 2 years ago • 6 comments

Noticed a billion IOPs for an SVM with 10 nodes with svm_nfs_ops metric. Sometimes it also shows half a billion negative IOPs.

Running Latest Major Release for the Harvest

bin/harvest version harvest version 22.05.0-1 (commit 2bc2942) (build date 2022-05-11T07:57:16-0400) linux/amd64

1 day timeseries for svm_nfs_ops metric on a single SVM with 10 nodes. The spikes are billion IOPs.

image

Can you help check if it is Harvest Bug? OCUM shows 1 Million IOPs for the same duration for SVM when Prometheus shows 1B IOPs.

jmg011 avatar Aug 16 '22 21:08 jmg011

@jmg011 There is a similiar issue reported about negative counters related to svm nfs v3 #762. Could you confirm the numbers reported in system manager if they are in millions or billions?

We have handled negative counters #1205 by changing negative counters to 0 for our upcoming release.

rahulguptajss avatar Aug 17 '22 07:08 rahulguptajss

hi @jmg011 can you also share the ONTAP version and whether these are NFS v3, v4, or v4.1 shares?

cgrinds avatar Aug 17 '22 12:08 cgrinds

@rahulguptajss When you say system manager you mean OCUM? Ocum reports in million. Prometheus scrapped Billion from Harvest exporter.

image

@cgrinds Version: NetApp Release 9.8P9 & NFS v3

jmg011 avatar Aug 17 '22 19:08 jmg011

Thanks for the ONTAP and NFS version information. Our suspicion is this is an ONTAP counter bug since we have a few customer reports of negative counters with NFS. The Harvest logic for handling NFS counters is the same as the other performance counters.

System Manager is the UI that can be used to manage a cluster.

E.g. image

cgrinds avatar Aug 18 '22 11:08 cgrinds

hi @jmg011

  • Any changes to your conf/zapiperf/cdot/9.8.0/nfsv3.yaml template that captures this metric?
  • Would it be possible to monitor this cluster from a separate poller capturing trace logs? Something like this:
bin/poller --promPort 19002 --poller $poller-name --collectors ZapiPerf --objects NFSv3 --loglevel 0 2>&1 | tee nfs.txt

Let that run for 30 minutes or so and then email the nfs.txt file to [email protected]

cgrinds avatar Aug 18 '22 13:08 cgrinds

@cgrinds No changes to the template. I will start the poller in my dev to capture logs and will send it to the [email protected] today

jmg011 avatar Aug 22 '22 14:08 jmg011

Thanks again for the log files @jmg011; they were very helpful. We're working on some improvements in this area and will ping you when they made it through CI and integration tests.

cgrinds avatar Sep 08 '22 12:09 cgrinds

This issue is now fixed in main branch. Solution is to skip any negative counters or spikes generated due to this kind of data.

rahulguptajss avatar Sep 12 '22 07:09 rahulguptajss

hi @jmg011 when you get a chance, could you grab nightly and see if our latest fix address your billions problem? Thanks!

cgrinds avatar Sep 13 '22 14:09 cgrinds

verified negative counter logic in 22.11

rahulguptajss avatar Nov 15 '22 09:11 rahulguptajss