skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Change default gpu from K80 to T4

Open infwinston opened this issue 3 years ago • 5 comments

As discussed in https://github.com/sky-proj/sky/issues/700, T4 seems to be a better choice as the default gpu node. Please comment below if you have any additional thoughts.

Copied from the discussion.

Speaking of the default gpu node type, should we consider changing it to g4dn.xlarge? It can be cheaper (0.900 vs 0.526, tho host memory size is smaller) and T4 is a newer generation than K80. (remember that we had to choose images with older nvidia driver in order to support K80). It supports mixed precision and seems to be more cost efficient than K80 (details here).

A possible issue is that T4:1 requirement is satisfied by many VM instances on clouds so the candidate list is longer. not sure if users would like it or not.

Before:

(sky) weichiang@blaze:~/repos/sky$ sky gpunode
I 04-05 00:01:32 optimizer.py:601] Optimizer - plan minimizing cost
I 04-05 00:01:32 optimizer.py:613] Estimated cost: ~$0.2/hr
I 04-05 00:01:32 optimizer.py:628] 
I 04-05 00:01:32 optimizer.py:628] TASK     BEST_RESOURCE
I 04-05 00:01:32 optimizer.py:628] gpunode  Azure(Standard_NC6_Promo, {'K80': 1})
I 04-05 00:01:32 optimizer.py:628] 
I 04-05 00:01:32 optimizer.py:648] Considered resources -> cost ($)
I 04-05 00:01:32 optimizer.py:649] {Azure(Standard_NC6_Promo, {'K80': 1}): 0.22,
I 04-05 00:01:32 optimizer.py:649]  AWS(p2.xlarge, {'K80': 1}): 0.9,
I 04-05 00:01:32 optimizer.py:649]  GCP(n1-highmem-8, {'K80': 1}): 0.92}
I 04-05 00:01:32 optimizer.py:649] 
I 04-05 00:01:32 optimizer.py:665] Multiple Azure instances satisfy K80:1. The cheapest Azure(Standard_NC6_Promo, {'K80': 1}) is considered among:
I 04-05 00:01:32 optimizer.py:665] ['Standard_NC6_Promo', 'Standard_NC6'].
I 04-05 00:01:32 optimizer.py:665] 
I 04-05 00:01:32 optimizer.py:671] To list more details, run 'sky show-gpus K80'.

After:

(sky) weichiang@blaze:~/repos/sky$ sky gpunode
I 04-05 00:00:48 optimizer.py:601] Optimizer - plan minimizing cost
I 04-05 00:00:48 optimizer.py:613] Estimated cost: ~$0.5/hr
I 04-05 00:00:48 optimizer.py:628] 
I 04-05 00:00:48 optimizer.py:628] TASK     BEST_RESOURCE
I 04-05 00:00:48 optimizer.py:628] gpunode  AWS(g4dn.xlarge, {'T4': 1})
I 04-05 00:00:48 optimizer.py:628] 
I 04-05 00:00:48 optimizer.py:648] Considered resources -> cost ($)
I 04-05 00:00:48 optimizer.py:649] {GCP(n1-highmem-8, {'T4': 1}): 0.82,
I 04-05 00:00:48 optimizer.py:649]  Azure(Standard_NC4as_T4_v3, {'T4': 1}): 0.63,
I 04-05 00:00:48 optimizer.py:649]  AWS(g4dn.xlarge, {'T4': 1}): 0.53}
I 04-05 00:00:48 optimizer.py:649] 
I 04-05 00:00:48 optimizer.py:665] Multiple AWS instances satisfy T4:1. The cheapest AWS(g4dn.xlarge, {'T4': 1}) is considered among:
I 04-05 00:00:48 optimizer.py:665] ['g4dn.xlarge', 'g4dn.2xlarge', 'g4dn.4xlarge', 'g4dn.8xlarge', 'g4dn.16xlarge'].
I 04-05 00:00:48 optimizer.py:665] 
I 04-05 00:00:48 optimizer.py:665] Multiple Azure instances satisfy T4:1. The cheapest Azure(Standard_NC4as_T4_v3, {'T4': 1}) is considered among:
I 04-05 00:00:48 optimizer.py:665] ['Standard_NC4as_T4_v3', 'Standard_NC8as_T4_v3', 'Standard_NC16as_T4_v3'].
I 04-05 00:00:48 optimizer.py:665] 
I 04-05 00:00:48 optimizer.py:671] To list more details, run 'sky show-gpus T4'.

infwinston avatar Apr 05 '22 07:04 infwinston

Should we update "By default, use 1 K80..." here https://sky-proj-sky.readthedocs-hosted.com/en/latest/reference/interactive-nodes.html#interactive-nodes ?

On Tue, Apr 5, 2022 at 00:24 Zhanghao Wu @.***> wrote:

@.**** approved this pull request.

LGTM.

— Reply to this email directly, view it on GitHub https://github.com/sky-proj/sky/pull/718#pullrequestreview-931380078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHURZ6FOTTBDQUS76Y3VDPTA3ANCNFSM5SRYIE7Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

concretevitamin avatar Apr 05 '22 15:04 concretevitamin

A minor wording thing:

The cheapest AWS(g4dn.xlarge, {'T4': 1}) is considered among:

Should we change "considered" to "picked"?

concretevitamin avatar Apr 05 '22 15:04 concretevitamin

Quick question before we merge this - does everyone typically have quotas for g instance types? Here's what my quotas look like: image

I've always used p instances for my workloads (even before I used sky), and now I have to additionally request quotas for g instances too to get gpunode to work out of the box.

If this change means we have to get people to request two quotas (g and p) instead of one (only p, which many folks usually have), then maybe we should give this a little more thought since it may increase friction, esp for new users...

romilbhardwaj avatar Apr 05 '22 17:04 romilbhardwaj

Should we update "By default, use 1 K80..." here https://sky-proj-sky.readthedocs-hosted.com/en/latest/reference/interactive-nodes.html#interactive-nodes ?

Fixed. Thanks for catching this!

A minor wording thing:

The cheapest AWS(g4dn.xlarge, {'T4': 1}) is considered among:

Should we change "considered" to "picked"?

The reason why I used "considered" is that the cheapest instance from each cloud may not be the final choice so "picked" can be ambiguous. it's more like picked for further comparisons. like the below msg for Azure would also be shown

... Azure(Standard_NC4as_T4_v3, {'T4': 1}) is picked among ...

anyway I think "picked" does look shorter so we can go with that.

infwinston avatar Apr 05 '22 17:04 infwinston

I've always used p instances for my workloads (even before I used sky), and now I have to additionally request quotas for g instances too to get gpunode to work out of the box.

If this change means we have to get people to request two quotas (g and p) instead of one (only p, which many folks usually have), then maybe we should give this a little more thought since it may increase friction, esp for new us

@romilbhardwaj good point! Yeah this could be a problem that users need to request G and P for T4 and V100 separately. But AWS generally approves G instances more easily than P (they usually direct users to use G instance when P quota got rejected). (my G quota is 256 vs P 160)

Not sure how much hassle it'd cause given users need to request quotas anyway (by default both are zero I think). Ultimate solution would be to base on what your quota look like and choose a corresponding gpu for you.

infwinston avatar Apr 05 '22 17:04 infwinston