opencost
opencost copied to clipboard
Fix gpuCost computation error
What does this PR change?
Solve gpuCost computation error
Does this PR address any GitHub or Zendesk issues?
Closes https://github.com/opencost/opencost/issues/2657
Does this PR require changes to documentation?
No
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
opencost | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Mar 25, 2024 10:11am |
@ElieLiabeuf thanks for the contribution. Could you sign your commit and fix the DCO warning? This looks relatively straightforward
@ElieLiabeuf thanks for the contribution. Could you sign your commit and fix the DCO warning? This looks relatively straightforward
It's fixed, thanks.
@AjayTripathy you asked him to contribute this patch in https://github.com/opencost/opencost/issues/2657, could you review?
@ElieLiabeuf this generally LGTM but I'm struggling to test this on my end without shared GPUs. Could you share how you tested and verified this fix?
In particular, could you share the results of /costDataModel after the fix along with node labels? I think that should be sufficient to get this merged.
This pull request has been marked as stale because it has been open for 90 days with no activity. Please remove the stale label or comment or this pull request will be closed in 5 days.
This pull request was closed because it has been inactive for 95 days with no activity.
@thomasvn could you possibly pick up testing this as part of your GPU sprint?
Sure! @ElieLiabeuf I'll take a look at this soon.
Let's please get this in ASAP.
@ElieLiabeuf Could we try to merge this change into the develop
branch instead of directly into the v1.109
branch? A new PR may need to be created.
@thomasvn if this is meaningful I may just ask you to take over the fix?
@AjayTripathy Yes I can do that soon!
Following up to see if we can get this in this week and start testing.
After @thomasvn and I spent a few minutes looking at the code, it seems apparent that this logic will be reached if the node label nvidia.com/gpu.replicas
is found in addition to there being any allocatable capacity for nvidia.com/gpu
(the standard GPU resource). However, because this label is provided by GFD irrespective of whether any sharing strategy is in use (exclude one specific MIG configuration), getting this fix in is important since the calculations will be off any time GFD is deployed in the cluster. In cases where GFD and, therefore, this label are absent, the logic being patched here appears to be unaffected.
In that case what do we need to merge @thomasvn ?
We can close this PR now, as the fix will be available in https://github.com/opencost/opencost/pull/2862