aws-virtual-gpu-device-plugin issues

Pod keeps restarting when two containers share GPU

I am trying to run Nvidia-triton containers for model inferencing, however when more than 1 container is allocated to the same node, one of the container 1) Either fails to...

parth-chudasama

Update Kubernetes device plugin Link

*Issue #, if available:* No issue *Description of changes:* The old link is no longer working. Point the link to the moved repo address.

Wei-1

How can I monitor GPU metrics virtualized by this plugin using Nvidia DCGM exporter?

I've an AWS EKS cluster with GPU nodes, and installed AWS virtual gpu device plugin to share GPU between different pods. It seems that this exporter dependent on Nvidia device...

jaggerwang

0/8 nodes are available: 8 Insufficient k8s.amazonaws.com/vgpu.

hello I have got error like this when I start my pod: 0/8 nodes are available: 8 Insufficient k8s.amazonaws.com/vgpu. I do have 8 nodes, of which two are g4dn.xlarge nodes,...

FanniSun

Bump github.com/gogo/protobuf from 1.3.0 to 1.3.2

Bumps [github.com/gogo/protobuf](https://github.com/gogo/protobuf) from 1.3.0 to 1.3.2. Release notes Sourced from github.com/gogo/protobuf's releases. Release v.1.3.2 Tested versions: go 1.15.6 protoc 3.14.0 Bug fixes: skippy peanut butter Release v1.3.1 Tested versions: go...

dependabot[bot]

dependencies

Are there any plans to support CUDA_MPS_PINNED_DEVICE_MEM_LIMIT?

3

## Present Status I understand the current system configuration as follows: - Currently, the amount of GPU threads used by Pod seems to be controlled by CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. - And the...

t-ibayashi-safie

Autoscaler support

5

GPU sharing works perfectly fine, but when trying to scale pods based on gpu share, cluster-autoscaler is unable to scale instances based on requirement with following errors. ``` clusterautoscaler-aws-cluster-autoscaler-6dbcb4d4f7-fv5w7 aws-cluster-autoscaler...

dempti

GPU Memory errors leads to hanging GPU

2

Hello, when using this plugin, I was able to run `pytorch` models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a...

Narsil

vGPU Telemetry

1

Is there a way to log vgpu utilization metrics and monitor with aws-virtual-gpu-device-plugin? I currently use nvml library with datadog but it is not aware of the virtual GPUs so...

amybachir

Updated device-plugin.yaml with nvidia-cuda tag

*Issue #, if available:* Manifest fails since latest tag is deprecated, moving to nvidia/cuda:11.3.0-runtime-ubuntu18.04 *Description of changes:* Move from nvidia/cuda:latest -> nvidia/cuda:11.3.0-runtime-ubuntu18.04 By submitting this pull request, I confirm that...

hemandee

aws-virtual-gpu-device-plugin
aws-virtual-gpu-device-plugin copied to clipboard

Metadata

Pod keeps restarting when two containers share GPU

Update Kubernetes device plugin Link

How can I monitor GPU metrics virtualized by this plugin using Nvidia DCGM exporter?

0/8 nodes are available: 8 Insufficient k8s.amazonaws.com/vgpu.

Bump github.com/gogo/protobuf from 1.3.0 to 1.3.2

Are there any plans to support CUDA_MPS_PINNED_DEVICE_MEM_LIMIT?

Autoscaler support

GPU Memory errors leads to hanging GPU

vGPU Telemetry

Updated device-plugin.yaml with nvidia-cuda tag

← Metadata

Owner

Metadata

aws-virtual-gpu-device-plugin aws-virtual-gpu-device-plugin copied to clipboard

Metadata

← Metadata

Owner

Metadata

aws-virtual-gpu-device-plugin
aws-virtual-gpu-device-plugin copied to clipboard