nvidiagpubeat icon indicating copy to clipboard operation
nvidiagpubeat copied to clipboard

Can nvidiagpubeat be made to also export the process running on each card?

Open musiczhzhao opened this issue 3 years ago • 10 comments

Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.

I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.

All the best.

musiczhzhao avatar Nov 10 '20 18:11 musiczhzhao

@musiczhzhao Yes, it can. I had a piece of code for it. I will try and integrate into nvidiagpubeat.

deepujain avatar Nov 10 '20 19:11 deepujain

@deepujain Thank you! 👍

musiczhzhao avatar Nov 12 '20 06:11 musiczhzhao

Hi @deepujain, How are things going? Just to check if there is any update? Any if any help are needed? Best

musiczhzhao avatar Dec 11 '20 16:12 musiczhzhao

The changes are ready. I lost access to my GPU cluster, hence testing the changes has become a challenge and created a dependency. Here is a sample

The --query-gpu will generate below event by nvidiagpubeat.

Publish event: Publish event: {
  "@timestamp": "2021-01-03T07:27:16.080Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "driver_version": "418.87.01",
  "index": 3,
  "gpu_serial": 3.20218176911e+11,
  "memory": {
    "used": 3256,
    "total": 16280
  },
  "name": "Tesla100-PCIE-16GB",
  "host": {
    "name": "AB-SJC-11111111"
  },
  "utilization": {
    "memory": 50,
    "gpu": 50
  },
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "pstate": 0,
  "gpu_bus_id": "00000000:19:00.0",
  "count": 4,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpuIndex": 3,
  "power": {
    "draw": 25.28,
    "limit": 250
  },
  "temperature": {
    "gpu": 24
  },
  "clocks": {
    "gr": 405,
    "sm": 405,
    "mem": 715
  }
}

The --query-compute-apps will generate below event by nvidiagpubeat.

Publish event: {
  "@timestamp": "2021-01-03T07:29:53.633Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "pid": 222414,
  "process_name": "python",
  "used_gpu_memory": 10,
  "gpu_bus_id": "00000000:19:00.0",
  "gpu_serial": 3.20218176911e+11,
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "gpu_name": "Tesla100-PCIE-16GB",
  "used_memory": 15,
  "gpuIndex": 3,
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "host": {
    "name": "LM-SJC-11004865"
  }
}

deepujain avatar Jan 03 '21 06:01 deepujain

@musiczhzhao I made the changes to nvidiagpubeat to support process details information and made it generic in the process. Please test and share the results here (including few sample events) for query-compute-apps (active GPU process details) .

It can now support all types of queries as it is generic. I have tested only --query-gpu and --query-compute-apps. In case you plan to use other options, let me know and you can help me with testing.

nvidia-smi -h

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu=                Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
    --query-retired-pages=      List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.

https://github.com/eBay/nvidiagpubeat#sample-event has details.

deepujain avatar Jan 03 '21 09:01 deepujain

@musiczhzhao

deepujain avatar Jan 06 '21 05:01 deepujain

Hi @deepujain, Thank you! I will test it and get back to you ASAP. 👍

Best

musiczhzhao avatar Jan 06 '21 07:01 musiczhzhao

Hello @deepujain,

Happy weekend!

I have briefly tested the new version and confirm it can export the application name and gpu memory usage of the application when --query-compute-apps is used.

One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.

For example, with following in configuration, it seems only export the compute app metrics:


## --query-gpu will provide information about GPU. query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate" ## --query-compute-apps will list currently active compute processes. query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"


Another question is we find it useful to have the full command line of the app. For example, if a python script is launched with python, current nvidia-smi will just show app as python, without the actual script name and arguments. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username) Can we have this build-in so it can have the cmd just as metricbeat does?

Best, Zhao

musiczhzhao avatar Jan 16 '21 01:01 musiczhzhao

Hello Zhao,

Thank you for testing out. Please share sample events for both the queries --query-compute-apps and --query-gpu. It will help me update the documentation with real events. I can then close this issue as the current code seems to have met the expectation of Issue #29 .

Could you please raise seperate github issues for each new feature request.

  1. One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective. Please share expected sample events of a combined query "--query-compute-apps-and--query-gpu"

  2. Enriched version for "--query-compute-apps" to get additional details of process. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username)

Cheers Deepak

deepujain avatar Jan 16 '21 18:01 deepujain

Hi @deepujain,

I did a bit more testing which took some time.

Another issue we found is that the new version seems assume there is only one app running on each GPU card, or nvidia-smi only return 4 processes if there are 4 GPU cards on a machine. Otherwise it will crash with following error message.

2021-01-26T12:00:20.226-0600 INFO runtime/panic.go:975 nvidiagpubeat stopped. 2021-01-26T12:00:20.259-0600 FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range [4] with length 4", "stack": "github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:155\nruntime.gopanic\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:969\nruntime.goPanicIndex\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:88\ngithub.com/ebay/nvidiagpubeat/nvidia.Utilization.run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/gpu.go:122\ngithub.com/ebay/nvidiagpubeat/nvidia.Metrics.Get\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/metrics.go:52\ngithub.com/ebay/nvidiagpubeat/beater.(*Nvidiagpubeat).Run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/beater/nvidiagpubeat.go:73\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance. ...

The code allocating the event is in line 71 of nvidia/gpu.go: events := make([]common.MapStr, gpuCount, 2*gpuCount)

I will attached the sample events in a separate post.

Best, Zhao

musiczhzhao avatar Jan 27 '21 21:01 musiczhzhao