harvest Enable metric tags using volume comment for zapiperf objects.

Is your feature request related to a problem? Please describe. Recently impliemented metric tagging using volume comment (for capacity metrics), however, unable to do the same for performance metrics. I need to display perf metrics leveraging the K:V pairs within volume comments.

Describe the solution you'd like Either a similar implementation (for Harvest2 poller configuration) as currently possible for capacity metrics -or- alternative solution (see below)

Describe alternatives you've considered Chris Grindstaff raised an idea of a custom plugin (volume-tagger) if this idea can be proven out, then would like documentation on creating such a custom plugin.

Additional context

https://netapppub.slack.com/archives/C02072M1UCD/p1639154984401500?thread_ts=1638986222.380400&cid=C02072M1UCD

Dec 20 '21 20:12 chadpruden

Curious if there are any updates for this?

Sep 02 '22 18:09 chadpruden

@chadpruden We'll discuss this for our next release and update here. Thanks for the follow up.

Sep 06 '22 09:09 rahulguptajss

@chadpruden I have found a way to merge comment information to zapiperf counters and it doesn't require any plugin. Below are the steps.

1: Modify volume.yaml zapi template as below. Adds comment and exports comment and instance_uuid to existing template.

name:                     Volume
query:                    volume-get-iter
object:                   volume

# increase client timeout for volumes
client_timeout:           2m

counters:
  volume-attributes:
    - volume-autosize-attributes:
      - maximum-size
      - grow-threshold-percent

    - volume-id-attributes:
      - ^^instance-uuid             => instance_uuid
      - ^name                       => volume
      - ^node                       => node
      - ^owning-vserver-name        => svm
      - ^containing-aggregate-name  => aggr
      - ^containing-aggregate-uuid  => aggrUuid
      - ^style-extended             => style
      - ^type                       => type
      - ^comment                    => comment

    - volume-inode-attributes:
      - files-used
      - files-total

    - volume-sis-attributes:
      - compression-space-saved               => sis_compress_saved
      - deduplication-space-saved             => sis_dedup_saved
      - total-space-saved                     => sis_total_saved
      - percentage-compression-space-saved    => sis_compress_saved_percent
      - percentage-deduplication-space-saved  => sis_dedup_saved_percent
      - percentage-total-space-saved          => sis_total_saved_percent
      - ^is-sis-volume                        => is_sis_volume

    - volume-space-attributes:
      - expected-available
      - filesystem-size                       => filesystem_size
      - logical-available
      - logical-used
      - logical-used-by-afs
      - logical-used-by-snapshots
      - logical-used-percent
      - physical-used
      - physical-used-percent
      - size                                => size
      - size-available                      => size_available
      - size-total                          => size_total
      - size-used                           => size_used
      - percentage-size-used                => size_used_percent
      - size-used-by-snapshots              => snapshots_size_used
      - size-available-for-snapshots        => snapshots_size_available
      - snapshot-reserve-available          => snapshot_reserve_available
      - snapshot-reserve-size               => snapshot_reserve_size
      - percentage-snapshot-reserve         => snapshot_reserve_percent
      - percentage-snapshot-reserve-used    => snapshot_reserve_used_percent

    - volume-state-attributes:
      - ^state
      - ^status

    - volume-snapshot-attributes:
      - ^auto-snapshots-enabled             => auto_snapshots_enabled
      - ^snapshot-policy
      - snapshot-count
    - ^encrypt                              => isEncrypted

plugins:
  Volume:
    schedule:
      - data: 900s  # should be multiple of data poll duration
    #batch_size: "50"
  LabelAgent:
    # metric label zapi_value rest_value `default_value`
    value_to_num:
      - new_status state online online `0`
    exclude_equals:
      - style `flexgroup_constituent`
    # To prevent visibility of transient volumes, uncomment the following lines
#    exclude_regex:
#      # Exclude SnapProtect/CommVault Intellisnap, Clone volumes have a “_CVclone” suffix
#      - volume `.+_CVclone`
#      # Exclude SnapCenter, Clone volumes have a “DDMMYYhhmmss” suffix
#      - volume `.+(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])\d\d[0-9]{6}`
#      # Exclude manually created SnapCreator clones, Clone volumes have a “cl_” prefix and a “_YYYYMMDDhhmmss” suffix
#      - volume `cl_.+_(19|20)\d\d(0[1-9]|1[012])( 0[1-9]|[12][0-9]|3[01])[0-9]{6}`
#      # Exclude SnapDrive/SnapManager, Clone volumes have a “sdw_cl_” prefix
#      - volume `sdw_cl_.+`
#      # Exclude Metadata volumes, CRS volumes in SVM-DR or MetroCluster have a “MDV_CRS_” prefix
#      - volume `MDV_CRS_.+`
#      # Exclude Metadata volumes, Audit volumes have a “MDV_aud_” prefix
#      - volume `MDV_aud_.+`
    replace:
      - style style `flexgroup_constituent` `flexgroup`
  Aggregator:
    - volume<style=flexgroup>volume node,svm,aggr,style

export_options:
  instance_keys:
    - volume
    - node
    - svm
    - aggr
    - style
  instance_labels:
    - state
    - is_sis_volume
    - snapshot_policy
    - type
    - protectedByStatus
    - protectedBy
    - protectionRole
    - all_sm_healthy
    - isEncrypted
    - isHardwareEncrypted
    - comment
    - instance_uuid

2: Modify volume.yaml zapiperf template as below. Added instance_uuid in exports.


name:                     Volume
query:                    volume
object:                   volume

instance_key:             uuid

counters:
  - instance_uuid
  - instance_name
  - vserver_name          => svm
  - node_name             => node
  - parent_aggr           => aggr
  - read_data
  - write_data
  - read_ops
  - write_ops
  - other_ops
  - total_ops
  - read_latency
  - write_latency
  - other_latency
  - avg_latency

plugins:
  - Volume
#  - LabelAgent:
#      # To prevent visibility of transient volumes, uncomment the following lines
#      exclude_regex:
#        # Exclude SnapProtect/CommVault Intellisnap, Clone volumes have a “_CVclone” suffix
#        - volume `.+_CVclone`
#        # Exclude SnapCenter, Clone volumes have a “DDMMYYhhmmss” suffix
#        - volume `.+(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])\d\d[0-9]{6}`
#        # Exclude manually created SnapCreator clones, Clone volumes have a “cl_” prefix and a “_YYYYMMDDhhmmss” suffix
#        - volume `cl_.+_(19|20)\d\d(0[1-9]|1[012])( 0[1-9]|[12][0-9]|3[01])[0-9]{6}`
#        # Exclude SnapDrive/SnapManager, Clone volumes have a “sdw_cl_” prefix
#        - volume `sdw_cl_.+`
#        # Exclude Metadata volumes, CRS volumes in SVM-DR or MetroCluster have a “MDV_CRS_” prefix
#        - volume `MDV_CRS_.+`
#        # Exclude Metadata volumes, Audit volumes have a “MDV_aud_” prefix
#        - volume `MDV_aud_.+`

export_options:
  instance_keys:
    - volume
    - node
    - svm
    - aggr
    - style
    - instance_uuid

3: Restart harvest pollers and run below query in prometheus after at least 5 mins of poller start. This will merge comment data from volume_labels to volume_read_data zapiperf metric. You'll need to do the same for other volume metric.

volume_read_data * on(instance_uuid) group_left(comment) volume_labels

Let us know if that works.

Sep 16 '22 14:09 rahulguptajss

@chadpruden Let us know if above solution helps.

Sep 26 '22 06:09 rahulguptajss

@rahulguptajss We are currently working through implementing the solution you have outlined for us above, however we are stuck on this section as we use InfluxDB and not Prometheus:

`3: Restart harvest pollers and run below query in prometheus after at least 5 mins of poller start. This will merge comment data from volume_labels to volume_read_data zapiperf metric. You'll need to do the same for other volume metric.

volume_read_data * on(instance_uuid) group_left(comment) volume_labels`

Can you reference if there is an InfluxDB equivalent to this command that will achieve the same result, or is this not entirely possible / applicable with Influx?

Also, we are curious how would your above write-up / recommendations be different if we were not using the out-of-the-box *.yaml collectors? For example, we would like to implement your solution using the below as a custom collector object:

name: Harvest_Custom_Volume query: volume object: harvest_custom_volume

We are hesitant to modify the standard collectors as we rely on them for (many) distributed dashboards, so we would like to sandbox / test this first using custom collectors before we implement this more broadly.

Thanks much for your help and support, Rahul

-Scott

Oct 04 '22 21:10 electrocreative

hi Scott - Yes, those changes can be made in separate template files so the out-the-box-ones are not touched. I've outlined the steps below.

In regards to InfluDB, what Rahul pasted above is a Prometheus join on volume_read_data and volume_labels. InfluxDB 2 supports joins too, although it's been awhile since I've used them. I'll see if I can dig something up.

New templates for Volume_With_Tags

Summary

We're going to create two custom.yaml files, one for the Zapi collector and another for the ZapiPerf collector. Those two custom.yaml files will include the templates Rahul shared above.

Create conf/zapi/custom.yaml
Create conf/zapi/cdot/9.8.0/volume_with_tag.yaml
Create conf/zapiperf/custom.yaml
Create conf/zapiperf/cdot/9.8.0/volume_with_tag.yaml

Details

If you cd to your Harvest install directory, you can copy/paste the following code sections to create the files.

Create conf/zapi/custom.yaml

echo '
objects:
  VolWithTag: volume_with_tag.yaml
' > conf/zapi/custom.yaml

Create conf/zapi/cdot/9.8.0/volume_with_tag.yaml

echo '
name:                     Volume
query:                    volume-get-iter
object:                   volwithtag

# increase client timeout for volumes
client_timeout:           2m

counters:
  volume-attributes:
    - volume-autosize-attributes:
        - maximum-size
        - grow-threshold-percent

    - volume-id-attributes:
        - ^^instance-uuid             => instance_uuid
        - ^name                       => volume
        - ^node                       => node
        - ^owning-vserver-name        => svm
        - ^containing-aggregate-name  => aggr
        - ^containing-aggregate-uuid  => aggrUuid
        - ^style-extended             => style
        - ^type                       => type
        - ^comment                    => comment

    - volume-inode-attributes:
        - files-used
        - files-total

    - volume-sis-attributes:
        - compression-space-saved               => sis_compress_saved
        - deduplication-space-saved             => sis_dedup_saved
        - total-space-saved                     => sis_total_saved
        - percentage-compression-space-saved    => sis_compress_saved_percent
        - percentage-deduplication-space-saved  => sis_dedup_saved_percent
        - percentage-total-space-saved          => sis_total_saved_percent
        - ^is-sis-volume                        => is_sis_volume

    - volume-space-attributes:
        - expected-available
        - filesystem-size                       => filesystem_size
        - logical-available
        - logical-used
        - logical-used-by-afs
        - logical-used-by-snapshots
        - logical-used-percent
        - physical-used
        - physical-used-percent
        - size                                => size
        - size-available                      => size_available
        - size-total                          => size_total
        - size-used                           => size_used
        - percentage-size-used                => size_used_percent
        - size-used-by-snapshots              => snapshots_size_used
        - size-available-for-snapshots        => snapshots_size_available
        - snapshot-reserve-available          => snapshot_reserve_available
        - snapshot-reserve-size               => snapshot_reserve_size
        - percentage-snapshot-reserve         => snapshot_reserve_percent
        - percentage-snapshot-reserve-used    => snapshot_reserve_used_percent

    - volume-state-attributes:
        - ^state
        - ^status

    - volume-snapshot-attributes:
        - ^auto-snapshots-enabled             => auto_snapshots_enabled
        - ^snapshot-policy
        - snapshot-count
    - ^encrypt                              => isEncrypted

plugins:
  Volume:
    schedule:
      - data: 900s  # should be multiple of data poll duration
    #batch_size: "50"
  LabelAgent:
    # metric label zapi_value rest_value `default_value`
    value_to_num:
      - new_status state online online `0`
    exclude_equals:
      - style `flexgroup_constituent`
    # To prevent visibility of transient volumes, uncomment the following lines
    #    exclude_regex:
    #      # Exclude SnapProtect/CommVault Intellisnap, Clone volumes have a “_CVclone” suffix
    #      - volume `.+_CVclone`
    #      # Exclude SnapCenter, Clone volumes have a “DDMMYYhhmmss” suffix
    #      - volume `.+(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])\d\d[0-9]{6}`
    #      # Exclude manually created SnapCreator clones, Clone volumes have a “cl_” prefix and a “_YYYYMMDDhhmmss” suffix
    #      - volume `cl_.+_(19|20)\d\d(0[1-9]|1[012])( 0[1-9]|[12][0-9]|3[01])[0-9]{6}`
    #      # Exclude SnapDrive/SnapManager, Clone volumes have a “sdw_cl_” prefix
    #      - volume `sdw_cl_.+`
    #      # Exclude Metadata volumes, CRS volumes in SVM-DR or MetroCluster have a “MDV_CRS_” prefix
    #      - volume `MDV_CRS_.+`
    #      # Exclude Metadata volumes, Audit volumes have a “MDV_aud_” prefix
    #      - volume `MDV_aud_.+`
    replace:
      - style style `flexgroup_constituent` `flexgroup`
  Aggregator:
    - volume<style=flexgroup>volume node,svm,aggr,style

export_options:
  instance_keys:
    - volume
    - node
    - svm
    - aggr
    - style
  instance_labels:
    - state
    - is_sis_volume
    - snapshot_policy
    - type
    - protectedByStatus
    - protectedBy
    - protectionRole
    - all_sm_healthy
    - isEncrypted
    - isHardwareEncrypted
    - comment
    - instance_uuid
' > conf/zapi/cdot/9.8.0/volume_with_tag.yaml

Create conf/zapiperf/custom.yaml

echo '
objects:
  VolWithTag: volume_with_tag.yaml
' > conf/zapiperf/custom.yaml

Create conf/zapiperf/cdot/9.8.0/volume_with_tag.yaml

echo '

name:                     Volume
query:                    volume
object:                   volwithtag

instance_key:             uuid

counters:
  - instance_uuid
  - instance_name         => volume
  - vserver_name          => svm
  - node_name             => node
  - parent_aggr           => aggr
  - read_data
  - write_data
  - read_ops
  - write_ops
  - other_ops
  - total_ops
  - read_latency
  - write_latency
  - other_latency
  - avg_latency

plugins:
  - Volume
#  - LabelAgent:
#      # To prevent visibility of transient volumes, uncomment the following lines
#      exclude_regex:
#        # Exclude SnapProtect/CommVault Intellisnap, Clone volumes have a “_CVclone” suffix
#        - volume `.+_CVclone`
#        # Exclude SnapCenter, Clone volumes have a “DDMMYYhhmmss” suffix
#        - volume `.+(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])\d\d[0-9]{6}`
#        # Exclude manually created SnapCreator clones, Clone volumes have a “cl_” prefix and a “_YYYYMMDDhhmmss” suffix
#        - volume `cl_.+_(19|20)\d\d(0[1-9]|1[012])( 0[1-9]|[12][0-9]|3[01])[0-9]{6}`
#        # Exclude SnapDrive/SnapManager, Clone volumes have a “sdw_cl_” prefix
#        - volume `sdw_cl_.+`
#        # Exclude Metadata volumes, CRS volumes in SVM-DR or MetroCluster have a “MDV_CRS_” prefix
#        - volume `MDV_CRS_.+`
#        # Exclude Metadata volumes, Audit volumes have a “MDV_aud_” prefix
#        - volume `MDV_aud_.+`

export_options:
  instance_keys:
    - volume
    - node
    - svm
    - aggr
    - style
    - instance_uuid
' > conf/zapiperf/cdot/9.8.0/volume_with_tag.yaml

Oct 05 '22 12:10 cgrinds

You can test from the command line, like so, to verify everything's working. Change $poller to match your poller.

bin/poller --poller $poller --objects VolWithTag

Logs showing the custom templates are being used

2022-10-05T08:37:40-04:00 INF collector/helpers.go:134 > best-fit template Poller=u2 collector=Zapi:VolWithTag path=conf/zapi/cdot/9.8.0/volume_with_tag.yaml v=9.9.1
2022-10-05T08:37:41-04:00 INF collector/helpers.go:134 > best-fit template Poller=u2 collector=ZapiPerf:VolWithTag path=conf/zapiperf/cdot/9.8.0/volume_with_tag.yaml v=9.9.1

Oct 05 '22 12:10 cgrinds

@cgrinds Would this custom collector for volume_with_tags be publishing to same measurement-name (object: volume) as the out-of-box collector? Thus causing double-counting (two parallel collections) of our flexvols into Influx?

Oct 05 '22 18:10 chadpruden

@chadpruden good catch, yes it would. I updated the example and changed the object name to volwithtag. Any name will work.

Oct 05 '22 18:10 cgrinds

@cgrinds Thank you for all your assistance with this Chris, much appreciated.

We have followed your write-up and believe we are close to having it working, but are stuck on one aspect we are hoping you can shed some light on. The poller sees our custom object ("TAPIVolume") and we see this custom tag being picked up in Grafana, but we are not seeing any metrics flowing, either in the logs or in Grafana.

Noticed this in the logs, wondering if either of these errors are indicative of a known / common misconfiguration, or if there is something else you recommend us looking at to advance our troubleshooting on this.

{"level":"error","Poller":"fas-xxxxxx","collector":"ZapiPerf:TAPIVolume","stack":[{"func":"New","line":"35","source":"errors.go"},{"func":"(*Client).invoke","line":"366","source":"client.go"},{"func":"(*Client).InvokeBatchWithTimers","line":"285","source":"client.go"},{"func":"(*Client).InvokeBatchRequest","line":"258","source":"client.go"},{"func":"(*ZapiPerf).PollInstance","line":"1169","source":"zapiperf.go"},{"func":"(*task).Run","line":"60","source":"schedule.go"},{"func":"(*AbstractCollector).Start","line":"269","source":"collector.go"},{"func":"goexit","line":"1371","source":"asm_amd64.s"}],"error":"connection error => Post "https://fas-xxxxxx.domain.com:443/servlets/netapp.servlets.admin.XMLrequest_filer": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","caller":"goharvest2/cmd/collectors/zapiperf/zapiperf.go:1170","time":"2022-10-05T15:57:58-05:00","message":"instance request"}

{"level":"info","Poller":"fas-xxxxxx","collector":"ZapiPerf:TAPIVolume","caller":"goharvest2/cmd/poller/collector/collector.go:295","time":"2022-10-05T15:57:58-05:00","message":"no [TAPIVolume] instances on system, entering standby mode"}

Kind regards,

-Scott

Oct 06 '22 02:10 electrocreative

@electrocreative There is an timeout error as per the logs. Default timeout for ZapiPerf collector is 10s, You can increase this by adding client_timeout to ZapiPerf:TAPIVolume template as mentioned here. Let's try 30s, this time should ideally be less than the polling frequency of the collector which is by default 1m for ZapiPerf.

Oct 06 '22 10:10 rahulguptajss

@electrocreative I updated the conf/zapiperf/cdot/9.8.0/volume_with_tag.yaml template shared yesterday with a one line change at line 10.

Replace this - instance_name with this - instance_name => volume

Not sure if you sorted out the Flux join query or not, but @rahulguptajss and I looked at it today and managed to get this working. Not sure if exactly fits your case, but sharing in case it helps.

import "join"
import "influxdata/influxdb/schema"

left = from(bucket: "harvest")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "volwithtag")
  |> filter(fn: (r) => r["_field"] == "read_latency")
  |> schema.fieldsAsCols() 

right = from(bucket: "harvest")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) => r["_measurement"] == "volwithtag")
    |> filter(fn: (r) => r["_field"] == "comment")
    |> schema.fieldsAsCols() 

join.right(
    left: left,
    right: right,
    on: (l, r) => l.instance_uuid == r.instance_uuid,
    as: (l, r) => ({l with comment: r.comment}),
)

Oct 06 '22 18:10 cgrinds

verified in 22.11. @chadpruden Let us know the feedback.

To use the plugin, you need to enable VolumeTag plugin. Example below


name:                     Volume
query:                    volume
object:                   volume

instance_key:             uuid

counters:
  - instance_uuid
  - instance_name
  - vserver_name          => svm
  - node_name             => node
  - parent_aggr           => aggr
  - read_data
  - write_data
  - read_ops
  - write_ops
  - other_ops
  - total_ops
  - read_latency
  - write_latency
  - other_latency
  - avg_latency

plugins:
  - Volume
  - VolumeTag
  - Aggregator:
    - node


export_options:
  instance_keys:
    - volume
    - node
    - svm
    - aggr
    - style

Nov 18 '22 11:11 rahulguptajss