DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100)

Open haardm opened this issue 6 months ago • 1 comments

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched. We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

Is there an upstream fix from Nvidia that is planned? Is there any repercussion of temporarily not subscribing to this policy? What would go wrong if we let the PCIe errors to keep happening silently?

haardm avatar Aug 14 '24 18:08 haardm