svm_vol_total_ops does not match SVM IOPS in SM
A note for the community
No response
Problem
We see a discrepancy between the sum_vol_total_ops and the IOPS from the System Manager, which we can't explain
SM:
NABox Dashboard:
Configuration
No response
Poller
Version
23.11.0 and 24.02.0
Poller logs
No response
OS and platform
NABox
ONTAP or StorageGRID version
ONTAP 9.11.1P8
Additional Context
No response
References
No response
SM:
NABox:
Hi @BrendonA667 , The 2 graphs which you have added above would not be directly comparable. Because SM shows data at protocols level(protocols/nfs/services/{svm_uuid}) where you can choose protocol in dropdown, whereas in Harvest, we do show data at svm level(svm_vol_write_ops).
SM screenshot where NFSv3 and iscsi options exist for this chosen svm.
If you would like to compare exact values with SM, then you can compare the iSCSI panels and NFS panels for that svm in SVM dashboard, which would be almost similar.
Hi @Hardikl thank you for your feedback. Should not be the sum_vol_total_ops even higher then the value inside the SM? Question is because, we use QoS on the SVM level and we check over the SVM dashboard if we hit the limit. In that specific case we had a difference of almost 5000 IOPs
Thanks @BrendonA667 for the response.
Could you please provide few more details to get better understanding of this.
- As you shared earlier the SM screenshot, could you share the bigger size screenshot which contains the protocol name in it, so we can compare SM values at each protocol level(NFSv3, NFSv4(if it's applicable), Iscsi).
- Also, can you share the Harvest screenshot for the same protocols, meaning Iscsi panels, Nfsv3 panels and Nfsv4(if it's applicable) at the same time range for comparison.
Yes, we agreed that svm_vol_total_ops value should be almost equal or greater than all protocol level values. We are also evaluating as source of svm_vol_total_ops would be # of operations per second serviced by the volume level.
Hi @Hardikl
CIFS SM
CIFS Grafana
NFSv3 SM
NFSv3 Grafana
@BrendonA667 Thank you for providing the details. It appears that There is a bug in Harvest dashboard where not all I/O operations are being represented under these panels. As mentioned in the KB article, the total ops for NFS/CIFS are not limited to read, write, and other ops. We need to account for a 'different' category, which will help us reconcile these values. Harvest does collect this data. Could you try the following queries to compare the results?
Please replace SVM_NAME with the appropriate value.
NFS Protocol Operations Query:
(svm_nfs_read_ops + svm_nfs_write_ops + svm_nfs_access_total + svm_nfs_getattr_total + svm_nfs_lookup_total + svm_nfs_setattr_total{svm="SVM_NAME"})
CIFS Protocol Operations Query:
sum(svm_cifs_op_count{svm="SVM_NAME"}) by (datacenter, cluster, svm)
If these queries yield the expected results, that would explain the discrepancy you are observing with the CIFS protocol above.
As we discussed earlier, SM displays protocol-level data, whereas Harvest presents data both at the protocol level and at a higher level with the metric svm_vol_total_ops. We still need to check if similiar gaps exists for svm_vol_total_ops metrics.
Please confirm if the protocol-level queries resolve the issue, and we can proceed with developing a fix.
@BrendonA667 Related to svm_vol_total_ops, if we consider as an example that there is only one protocol load present, say NFSv3, I have encountered an instance during my local testing where SM shows a spike that is approximately twice what Harvest reports. As a next step to validate this, I have run the ONTAP CLI, and the data from the ONTAP CLI matches with what Harvest reports as svm_vol_total_ops. Could you check the same on your end to see if that is the case?
Also, svm_vol_total_ops includes all the different categories (Read + Write + Other + Access + Getattr + Lookup + Punch Hole + Setattr) as we discussed earlier.
Below are the commands and data collected for the last 10 minutes, where the polling interval for Harvest and ONTAP is 1 minute.
Note: Ontap clock is 5:30 hour behind.
ONTAP
statistics vserver show -interval 60 -iterations 50 -vserver astra_ci_vc_esxi_24_75_data
*Total Read Write Other Read Write Latency
Vserver Ops Ops Ops Ops (Bps) (Bps) (us)
--------------------------- ------ ---- ----- ----- ------ -------- -------
astra_ci_vc_esxi_24_75_data 506 9 458 35 155170 13170857 213
A250-41-42-43 : 4/18/2024 14:53:03
astra_ci_vc_esxi_24_75_data 590 10 534 43 167850 14494488 249
A250-41-42-43 : 4/18/2024 14:54:02
*Total Read Write Other Read Write Latency
Vserver Ops Ops Ops Ops (Bps) (Bps) (us)
--------------------------- ------ ---- ----- ----- ------ -------- -------
astra_ci_vc_esxi_24_75_data 510 8 464 35 142995 13414206 247
A250-41-42-43 : 4/18/2024 14:55:02
astra_ci_vc_esxi_24_75_data 539 8 486 38 158387 14309751 196
A250-41-42-43 : 4/18/2024 14:56:01
*Total Read Write Other Read Write Latency
Vserver Ops Ops Ops Ops (Bps) (Bps) (us)
--------------------------- ------ ---- ----- ----- ------ -------- -------
astra_ci_vc_esxi_24_75_data 979 8 910 41 144948 40021916 207
A250-41-42-43 : 4/18/2024 14:57:01
astra_ci_vc_esxi_24_75_data 464 7 419 33 142660 10789548 234
A250-41-42-43 : 4/18/2024 14:58:00
*Total Read Write Other Read Write Latency
Vserver Ops Ops Ops Ops (Bps) (Bps) (us)
--------------------------- ------ ---- ----- ----- ------ -------- -------
astra_ci_vc_esxi_24_75_data 534 11 478 41 180414 14013031 245
A250-41-42-43 : 4/18/2024 14:58:59
astra_ci_vc_esxi_24_75_data 610 10 552 37 166538 15873727 231
A250-41-42-43 : 4/18/2024 14:59:59
*Total Read Write Other Read Write Latency
Vserver Ops Ops Ops Ops (Bps) (Bps) (us)
--------------------------- ------ ---- ----- ----- ------ ------- -------
astra_ci_vc_esxi_24_75_data 450 10 398 40 175906 8200806 223
A250-41-42-43 : 4/18/2024 15:00:58
astra_ci_vc_esxi_24_75_data 962 8 916 35 159171 42578977
212
Harvest:
SM:
As you can notice from the above, SM has a spike of around 2k, but both ONTAP and Harvest show spikes at around ~1k value.
Hi @rahulguptajss apologize the delay. About the protocol-level queries, with your queries the values are more or less the same. Can you say why the SM has such higher spikes then ONTAP itself and Harvest?
@BrendonA667, what I meant is that Harvest provides both SVM total operations via the metric svm_vol_total_ops and protocol-level SVM operations. However, SM is only showing protocol-level SVM operations.
My point is that we should compare these metrics with the ONTAP CLI to check which one matches. In one of my tests, SVM NFS Ops svm_nfs_ops had different values in SM and Harvest, but Harvest's values were matching with the ONTAP CLI.
To check these, can you compare the following?
- Compare Harvest's SVM Total Iops
svm_vol_total_opsagainst the ONTAP CLI with the command below, which displays total ops for an SVM every 1 minute, similar to Harvest's 1-minute schedule. A small variation is expected as Harvest's schedule may not be perfectly in sync with this CLI.
statistics vserver show -interval 60 -iterations 50 -vserver VSERVER
- As from your screenshot screenshot above shows, NFS ops for SVM are matching in both Harvest and SM, which is good. However, there are differences in CIFS ops. Also please import the SVM dashboard from here, which includes the CIFS ops fix for Harvest.
Let's compare this with the ONTAP CLI to check if Harvest's svm cifs ops matches. As mentioned earlier, SM uses a different API to collect its data, so we want to compare if Harvest's data matches with the ONTAP CLI.
Enter diag mode:
set d
Then run the following command:
statistics show-periodic -object cifs:vserver -interval 60 -iterations 50 -instance VSERVER -counter cifs_ops
@BrendonA667 Were you able to compare the data using ONTAP CLI as suggested here?
Verified on 24.05.0 commit 6617960b using 10.195.15.41