harvest svm_vol_total_ops does not match SVM IOPS in SM

A note for the community

No response

Problem

We see a discrepancy between the sum_vol_total_ops and the IOPS from the System Manager, which we can't explain SM: grafik

NABox Dashboard: grafik

Configuration

No response

Poller

Version

23.11.0 and 24.02.0

Poller logs

No response

OS and platform

NABox

ONTAP or StorageGRID version

ONTAP 9.11.1P8

Additional Context

No response

References

No response

Apr 15 '24 06:04 BrendonA667

SM: grafik

NABox: grafik

Apr 15 '24 06:04 BrendonA667

Hi @BrendonA667 , The 2 graphs which you have added above would not be directly comparable. Because SM shows data at protocols level(protocols/nfs/services/{svm_uuid}) where you can choose protocol in dropdown, whereas in Harvest, we do show data at svm level(svm_vol_write_ops).

SM screenshot where NFSv3 and iscsi options exist for this chosen svm.

If you would like to compare exact values with SM, then you can compare the iSCSI panels and NFS panels for that svm in SVM dashboard, which would be almost similar.

Apr 16 '24 10:04 Hardikl

Hi @Hardikl thank you for your feedback. Should not be the sum_vol_total_ops even higher then the value inside the SM? Question is because, we use QoS on the SVM level and we check over the SVM dashboard if we hit the limit. In that specific case we had a difference of almost 5000 IOPs

Apr 16 '24 12:04 BrendonA667

Thanks @BrendonA667 for the response.

Could you please provide few more details to get better understanding of this.

As you shared earlier the SM screenshot, could you share the bigger size screenshot which contains the protocol name in it, so we can compare SM values at each protocol level(NFSv3, NFSv4(if it's applicable), Iscsi).
Also, can you share the Harvest screenshot for the same protocols, meaning Iscsi panels, Nfsv3 panels and Nfsv4(if it's applicable) at the same time range for comparison.

Yes, we agreed that svm_vol_total_ops value should be almost equal or greater than all protocol level values. We are also evaluating as source of svm_vol_total_ops would be # of operations per second serviced by the volume level.

Apr 17 '24 13:04 Hardikl

Hi @Hardikl

CIFS SM grafik

CIFS Grafana grafik

NFSv3 SM grafik

NFSv3 Grafana grafik

Apr 18 '24 06:04 BrendonA667

@BrendonA667 Thank you for providing the details. It appears that There is a bug in Harvest dashboard where not all I/O operations are being represented under these panels. As mentioned in the KB article, the total ops for NFS/CIFS are not limited to read, write, and other ops. We need to account for a 'different' category, which will help us reconcile these values. Harvest does collect this data. Could you try the following queries to compare the results?

Please replace SVM_NAME with the appropriate value.

NFS Protocol Operations Query:

(svm_nfs_read_ops + svm_nfs_write_ops + svm_nfs_access_total + svm_nfs_getattr_total + svm_nfs_lookup_total + svm_nfs_setattr_total{svm="SVM_NAME"})

CIFS Protocol Operations Query:

sum(svm_cifs_op_count{svm="SVM_NAME"}) by (datacenter, cluster, svm)

If these queries yield the expected results, that would explain the discrepancy you are observing with the CIFS protocol above.

As we discussed earlier, SM displays protocol-level data, whereas Harvest presents data both at the protocol level and at a higher level with the metric svm_vol_total_ops. We still need to check if similiar gaps exists for svm_vol_total_ops metrics.

Please confirm if the protocol-level queries resolve the issue, and we can proceed with developing a fix.

Apr 18 '24 08:04 rahulguptajss

@BrendonA667 Related to svm_vol_total_ops, if we consider as an example that there is only one protocol load present, say NFSv3, I have encountered an instance during my local testing where SM shows a spike that is approximately twice what Harvest reports. As a next step to validate this, I have run the ONTAP CLI, and the data from the ONTAP CLI matches with what Harvest reports as svm_vol_total_ops. Could you check the same on your end to see if that is the case?

Also, svm_vol_total_ops includes all the different categories (Read + Write + Other + Access + Getattr + Lookup + Punch Hole + Setattr) as we discussed earlier.

Below are the commands and data collected for the last 10 minutes, where the polling interval for Harvest and ONTAP is 1 minute.

Note: Ontap clock is 5:30 hour behind.

ONTAP

statistics vserver show -interval 60 -iterations 50 -vserver astra_ci_vc_esxi_24_75_data

                            *Total Read Write Other   Read    Write Latency 
                    Vserver    Ops  Ops   Ops   Ops  (Bps)    (Bps)    (us) 
--------------------------- ------ ---- ----- ----- ------ -------- ------- 
astra_ci_vc_esxi_24_75_data    506    9   458    35 155170 13170857     213 

A250-41-42-43 : 4/18/2024 14:53:03
astra_ci_vc_esxi_24_75_data    590   10   534    43 167850 14494488     249 

A250-41-42-43 : 4/18/2024 14:54:02
                                                                            
                            *Total Read Write Other   Read    Write Latency 
                    Vserver    Ops  Ops   Ops   Ops  (Bps)    (Bps)    (us) 
--------------------------- ------ ---- ----- ----- ------ -------- ------- 
astra_ci_vc_esxi_24_75_data    510    8   464    35 142995 13414206     247 

A250-41-42-43 : 4/18/2024 14:55:02
astra_ci_vc_esxi_24_75_data    539    8   486    38 158387 14309751     196 

A250-41-42-43 : 4/18/2024 14:56:01
                                                                            
                            *Total Read Write Other   Read    Write Latency 
                    Vserver    Ops  Ops   Ops   Ops  (Bps)    (Bps)    (us) 
--------------------------- ------ ---- ----- ----- ------ -------- ------- 
astra_ci_vc_esxi_24_75_data    979    8   910    41 144948 40021916     207 

A250-41-42-43 : 4/18/2024 14:57:01
astra_ci_vc_esxi_24_75_data    464    7   419    33 142660 10789548     234 

A250-41-42-43 : 4/18/2024 14:58:00
                                                                            
                            *Total Read Write Other   Read    Write Latency 
                    Vserver    Ops  Ops   Ops   Ops  (Bps)    (Bps)    (us) 
--------------------------- ------ ---- ----- ----- ------ -------- ------- 
astra_ci_vc_esxi_24_75_data    534   11   478    41 180414 14013031     245 

A250-41-42-43 : 4/18/2024 14:58:59
astra_ci_vc_esxi_24_75_data    610   10   552    37 166538 15873727     231 

A250-41-42-43 : 4/18/2024 14:59:59
                                                                           
                            *Total Read Write Other   Read   Write Latency 
                    Vserver    Ops  Ops   Ops   Ops  (Bps)   (Bps)    (us) 
--------------------------- ------ ---- ----- ----- ------ ------- ------- 
astra_ci_vc_esxi_24_75_data    450   10   398    40 175906 8200806     223 

A250-41-42-43 : 4/18/2024 15:00:58
astra_ci_vc_esxi_24_75_data    962    8   916    35 159171 42578977
                                                                       212

Harvest:

SM:

As you can notice from the above, SM has a spike of around 2k, but both ONTAP and Harvest show spikes at around ~1k value.

Apr 18 '24 15:04 rahulguptajss

Hi @rahulguptajss apologize the delay. About the protocol-level queries, with your queries the values are more or less the same. Can you say why the SM has such higher spikes then ONTAP itself and Harvest?

Apr 24 '24 12:04 BrendonA667

@BrendonA667, what I meant is that Harvest provides both SVM total operations via the metric svm_vol_total_ops and protocol-level SVM operations. However, SM is only showing protocol-level SVM operations.

My point is that we should compare these metrics with the ONTAP CLI to check which one matches. In one of my tests, SVM NFS Ops svm_nfs_ops had different values in SM and Harvest, but Harvest's values were matching with the ONTAP CLI.

To check these, can you compare the following?

Compare Harvest's SVM Total Iops svm_vol_total_ops against the ONTAP CLI with the command below, which displays total ops for an SVM every 1 minute, similar to Harvest's 1-minute schedule. A small variation is expected as Harvest's schedule may not be perfectly in sync with this CLI.

    statistics vserver show -interval 60 -iterations 50 -vserver VSERVER

As from your screenshot screenshot above shows, NFS ops for SVM are matching in both Harvest and SM, which is good. However, there are differences in CIFS ops. Also please import the SVM dashboard from here, which includes the CIFS ops fix for Harvest.

Let's compare this with the ONTAP CLI to check if Harvest's svm cifs ops matches. As mentioned earlier, SM uses a different API to collect its data, so we want to compare if Harvest's data matches with the ONTAP CLI.

Enter diag mode:

  set d

Then run the following command:

   statistics show-periodic -object cifs:vserver -interval 60 -iterations 50 -instance VSERVER -counter cifs_ops

Apr 25 '24 09:04 rahulguptajss

@BrendonA667 Were you able to compare the data using ONTAP CLI as suggested here?

May 14 '24 05:05 rahulguptajss

Verified on 24.05.0 commit 6617960b using 10.195.15.41

May 16 '24 13:05 cgrinds