opencost icon indicating copy to clipboard operation
opencost copied to clipboard

Fix gpuCost computation error

Open ElieLiabeuf opened this issue 11 months ago • 5 comments

What does this PR change?

Solve gpuCost computation error

Does this PR address any GitHub or Zendesk issues?

Closes https://github.com/opencost/opencost/issues/2657

Does this PR require changes to documentation?

No

ElieLiabeuf avatar Mar 22 '24 17:03 ElieLiabeuf

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
opencost ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 25, 2024 10:11am

vercel[bot] avatar Mar 22 '24 17:03 vercel[bot]

@ElieLiabeuf thanks for the contribution. Could you sign your commit and fix the DCO warning? This looks relatively straightforward

mattray avatar Mar 22 '24 19:03 mattray

@ElieLiabeuf thanks for the contribution. Could you sign your commit and fix the DCO warning? This looks relatively straightforward

It's fixed, thanks.

ElieLiabeuf avatar Mar 25 '24 10:03 ElieLiabeuf

@AjayTripathy you asked him to contribute this patch in https://github.com/opencost/opencost/issues/2657, could you review?

mattray avatar Apr 10 '24 07:04 mattray

@ElieLiabeuf this generally LGTM but I'm struggling to test this on my end without shared GPUs. Could you share how you tested and verified this fix?

In particular, could you share the results of /costDataModel after the fix along with node labels? I think that should be sufficient to get this merged.

AjayTripathy avatar Apr 11 '24 22:04 AjayTripathy

This pull request has been marked as stale because it has been open for 90 days with no activity. Please remove the stale label or comment or this pull request will be closed in 5 days.

github-actions[bot] avatar Jul 11 '24 01:07 github-actions[bot]

This pull request was closed because it has been inactive for 95 days with no activity.

github-actions[bot] avatar Jul 16 '24 01:07 github-actions[bot]

@thomasvn could you possibly pick up testing this as part of your GPU sprint?

AjayTripathy avatar Jul 22 '24 20:07 AjayTripathy

Sure! @ElieLiabeuf I'll take a look at this soon.

thomasvn avatar Jul 22 '24 22:07 thomasvn

Let's please get this in ASAP.

chipzoller avatar Jul 28 '24 11:07 chipzoller

@ElieLiabeuf Could we try to merge this change into the develop branch instead of directly into the v1.109 branch? A new PR may need to be created.

thomasvn avatar Jul 29 '24 15:07 thomasvn

@thomasvn if this is meaningful I may just ask you to take over the fix?

AjayTripathy avatar Jul 29 '24 18:07 AjayTripathy

@AjayTripathy Yes I can do that soon!

thomasvn avatar Jul 31 '24 16:07 thomasvn

Following up to see if we can get this in this week and start testing.

chipzoller avatar Aug 05 '24 12:08 chipzoller

After @thomasvn and I spent a few minutes looking at the code, it seems apparent that this logic will be reached if the node label nvidia.com/gpu.replicas is found in addition to there being any allocatable capacity for nvidia.com/gpu (the standard GPU resource). However, because this label is provided by GFD irrespective of whether any sharing strategy is in use (exclude one specific MIG configuration), getting this fix in is important since the calculations will be off any time GFD is deployed in the cluster. In cases where GFD and, therefore, this label are absent, the logic being patched here appears to be unaffected.

chipzoller avatar Aug 05 '24 16:08 chipzoller

In that case what do we need to merge @thomasvn ?

AjayTripathy avatar Aug 21 '24 16:08 AjayTripathy

We can close this PR now, as the fix will be available in https://github.com/opencost/opencost/pull/2862

thomasvn avatar Aug 23 '24 16:08 thomasvn