nvidiagpubeat icon indicating copy to clipboard operation
nvidiagpubeat copied to clipboard

Incompatible with driver version 460.32.03 (because of the two dots)

Open aLeX1443 opened this issue 3 years ago • 10 comments

I believe the current version is not compatible with Nvidia driver version 460.32.03, due to it having two dots in the name.

Please see the end of the line below:

2021-01-13T20:22:02.494Z	WARN	elasticsearch/client.go:535	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbff7f37a9d203e60, ext:32072702120, loc:(*time.Location)(0x2090e20)}, Meta:common.MapStr(nil), Fields:common.MapStr{"agent":common.MapStr{"ephemeral_id":"fd12b93b-9db1-4e24-9e0d-747229464c00", "hostname":"dca939cfb9c2", "id":"33c198bd-3989-4a66-9683-fe258efbe53b", "type":"nvidiagpubeat", "version":"7.3.3"}, "clocks":common.MapStr{"gr":0, "mem":405, "sm":0}, "count":2, "driver_version":"460.32.03", "ecs":common.MapStr{"version":"1.0.1"}, "fan":common.MapStr{"speed":30}, "gpuIndex":1, "host":common.MapStr{"name":"dca939cfb9c2"}, "index":1, "memory":common.MapStr{"total":24268, "used":5942}, "name":"GeForceRTX3090", "power":common.MapStr{"draw":7.14, "limit":350}, "pstate":8, "temperature":common.MapStr{"gpu":28}, "type":"nvidiagpubeat", "utilization":common.MapStr{"gpu":0, "memory":0}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x0} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [driver_version] of type [float] in document with id 'tvFp_XYBFpUHIuKQj2b6'. Preview of field's value: '460.32.03'","caused_by":{"type":"number_format_exception","reason":"multiple points"}}

aLeX1443 avatar Jan 13 '21 20:01 aLeX1443

A temporary workaround seems to be to remove the driver_version from the configuration file, i.e.,

nvidiagpubeat:
  period: 1s
  query: "--query-gpu=name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  env: "dev"

output.elasticsearch:
  hosts: "${ELASTICSEARCH_HOSTS}"

aLeX1443 avatar Jan 13 '21 20:01 aLeX1443

Hello Alex, Thank you for raising the issue along with workaround.

Please share the output of below query that has driver_version on your NVIDIA GPU.

nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate

This way, i can re-create the issue with nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go and provide a fix.

Cheers Deepak

deepujain avatar Jan 16 '21 18:01 deepujain

Hello Alex, @aLeX1443

I am unable to re-create the issue.

nvidigpubeat.yml has driver_version and looks like below

nvidiagpubeat:
  # Defines how often an event is sent to the output
  period: 1s
  # By default the query of type query-gpu is executed to support backward compatibility
  # query: "name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  # A generic version of query is supported by nvidiagpubeat for query options like --query-gpu,--query-compute-apps and others.
  # -query-gpu will provide information about GPU.
  query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  # --query-compute-apps will list currently active compute processes.
  # query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"
  env: "test"
  # env can be test or production. test is for test purposes to evaluate funcationality of this beat. Switch to production

The output of above query on my real GPU is

nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,util
ization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv
name, pci.bus_id, serial, uuid, driver_version, count, index, fan.speed [%], memory.total [MiB], memory.used [MiB], utilization.gpu [%], utilization.memory [%], temperature.gpu, power.draw [W], power.li
mit [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.memory [MHz], pstate

Tesla P100-PCIE-16GB, 00000000:08:00.0, 1234567890123, GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab, 418.87.00, 1, 0, [Not Supported], 16280 MiB, 0 MiB, 0 %, 0 %, 28, 26.02 W, 250.00 W, 405 MHz, 405 MHz, 71
5 MHz, P0

Here the driver_version has two dots 418.87.00

The resulting output from nvidigpubeat is

2021-01-16T21:02:27.389-0800	DEBUG	[publish]	pipeline/processor.go:308	Publish event: {
  "@timestamp": "2021-01-17T05:02:27.388Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "driver_version": "418.87.00",
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "temperature": {
    "gpu": 28
  },
  "pstate": 0,
  "power": {
    "draw": 26.02,
    "limit": 250
  },
  "gpu_serial": 1234567890123,
  "name": "Tesla100-PCIE-16GB",
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "index": 0,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpu_uuid": "GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab",
  "host": {
    "name": "AA-ABC-11111111"
  },
  "gpu_bus_id": "00000000:08:00.0",
  "gpuIndex": 0,
  "memory": {
    "total": 16280,
    "used": 0
  },
  "count": 1,
  "clocks": {
    "sm": 405,
    "mem": 715,
    "gr": 405
  },
  "type": "nvidiagpubeat"
}

Each field correctly maps the CSV output from nvidia-smi command.

Cheers Deepak

deepujain avatar Jan 17 '21 05:01 deepujain

Hi @deepujain, here is the output of: nvidia-smi --query gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv

460.32.03, GeForce RTX 3090, 2, 0, 30 %, 24265 MiB, 10275 MiB, 0 %, 10 %, 28, 10.15 W, 350.00 W, 210 MHz, 210 MHz, 405 MHz, P8
460.32.03, GeForce RTX 3090, 2, 1, 30 %, 24268 MiB, 2846 MiB, 0 %, 0 %, 25, 8.39 W, 350.00 W, 0 MHz, 0 MHz, 405 MHz, P8

The driver version is the one installed by the Ubuntu Additional Drivers aplication. Would it be possible to test it with the same driver version? i.e., 460.32.03

aLeX1443 avatar Jan 17 '21 15:01 aLeX1443

@aLeX1443 I used the output that you shared and ingested into nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go . I ran nvidigpubeat (master branch) in local mode and i am able to get the events published correctly.

Publish event: {
  "@timestamp": "2021-01-17T15:34:57.213Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "index": 0,
  "utilization": {
    "gpu": 0,
    "memory": 10
  },
  "temperature": {
    "gpu": 28
  },
  "host": {
    "name": "AA-ABC-11111111"
  },
  "gpuIndex": 0,
  "power": {
    "draw": 10.15,
    "limit": 350
  },
  "pstate": 8,
  "clocks": {
    "gr": 210,
    "sm": 210,
    "mem": 405
  },
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "driver_version": "460.32.03",
  "type": "nvidiagpubeat",
  "name": "GeForceRTX3090",
  "count": 2,
  "memory": {
    "total": 24265,
    "used": 10275
  },
  "fan": {
    "speed": 30
  }
}
2021-01-17T07:34:57.213-0800	DEBUG	[publish]	pipeline/processor.go:308	Publish event: {
  "@timestamp": "2021-01-17T15:34:57.213Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "power": {
    "draw": 8.39,
    "limit": 350
  },
  "gpuIndex": 1,
  "driver_version": "460.32.03",
  "name": "GeForceRTX3090",
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "type": "nvidiagpubeat",
  "count": 2,
  "pstate": 8,
  "fan": {
    "speed": 30
  },
  "index": 1,
  "temperature": {
    "gpu": 25
  },
  "host": {
    "name": "AA-ABC-11111111"
  },
  "clocks": {
    "gr": 0,
    "sm": 0,
    "mem": 405
  },
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "memory": {
    "total": 24268,
    "used": 2846
  }
}

What error do you see with nvidiagpubeat ? What branch are you using with nvidiagpubeat (master or withBeats7.3) ?

deepujain avatar Jan 17 '21 15:01 deepujain

I do not have the flexibility to modify the driver version of GPUs on the cluster.

deepujain avatar Jan 17 '21 15:01 deepujain

The output that you shared here https://github.com/eBay/nvidiagpubeat/issues/32#issue-785417368 (description). I think nvidiagpubeat is able to understand the driver_version with multiple points and create the event for ES to consume. However it appears ES was not able to ingest it.

I see this in the event that you shared. "driver_version":"460.32.03"

deepujain avatar Jan 17 '21 15:01 deepujain

I was using branch withBeats7.3. I'll test it out with master once I get the chance

aLeX1443 avatar Jan 17 '21 16:01 aLeX1443

I found it to work with branch withBeats7.3

2021-01-17T09:07:00.947-0800	INFO	nvidia/gpu.go:68	Running command localnvidiasmi for query:  --query-gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate  with gpuCount 4
2021-01-17T09:07:01.099-0800	DEBUG	[nvidiagpubeat]	beater/nvidiagpubeat.go:77	Event generated, Attempting to publish to configured output.
2021-01-17T09:07:01.099-0800	DEBUG	[processors]	processing/processors.go:183	Publish event: {
processing/processors.go:183	Publish event: {
  "@timestamp": "2021-01-17T17:07:01.100Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "_doc",
    "version": "7.3.3"
  },
  "pstate": 8,
  "host": {
    "name": "AA-ABC-11111111"
  },
  "count": 2,
  "memory": {
    "used": 2846,
    "total": 24268
  },
  "power": {
    "draw": 8.39,
    "limit": 350
  },
  "clocks": {
    "mem": 405,
    "gr": 0,
    "sm": 0
  },
  "driver_version": "460.32.03",
  "agent": {
    "ephemeral_id": "071268ff-b8b3-44c4-bbbd-b378a2d26707",
    "hostname": "AA-ABC-11111111",
    "id": "9ebd65ba-4f83-4772-b361-98415432dee4",
    "version": "7.3.3",
    "type": "nvidiagpubeat"
  },
  "name": "GeForceRTX3090",
  "index": 1,
  "fan": {
    "speed": 30
  },
  "ecs": {
    "version": "1.0.1"
  },
  "temperature": {
    "gpu": 25
  },
  "gpuIndex": 1,
  "type": "nvidiagpubeat",
  "utilization": {
    "memory": 0,
    "gpu": 0
  }
}```

^C2021-01-17T09:07:01.601-0800 DEBUG [service] service/service.go:53 Received sigterm/sigint, stopping 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:149 client: closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":212,"time":{"ms":212}},"total":{"ticks":242,"time":{"ms":242},"value":242},"user":{"ticks":30,"time":{"ms":30}}},"info":{"ephemeral_id":"071268ff-b8b3-44c4-bbbd-b378a2d26707","uptime":{"ms":1685}},"memstats":{"gc_next":4194304,"memory_alloc":2049392,"memory_total":3658104,"rss":15740928},"runtime":{"goroutines":8}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":0,"events":{"active":2,"published":2,"total":2}}},"system":{"cpu":{"cores":8},"load":{"1":7.4155,"15":3.1431,"5":4.5801,"norm":{"1":0.9269,"15":0.3929,"5":0.5725}}}}}} 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:154 Uptime: 1.689151734s 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:131 Stopping metrics logging. 2021-01-17T09:07:01.606-0800 INFO instance/beat.go:432 nvidiagpubeat stopped.


I will wait for results from your testing with branch `withBeats7.3`

deepujain avatar Jan 17 '21 17:01 deepujain

@aLeX1443 Did you get a chance to look into it ?

deepujain avatar Jan 13 '22 15:01 deepujain