dstack icon indicating copy to clipboard operation
dstack copied to clipboard

Fix GB/GiB ambiguity

Open jvstme opened this issue 1 year ago • 5 comments

When displaying instance resources, dstack uses GB as the unit for RAM, VRAM, and disk. However, in many cases the values shown actually represent GiB, not GB. Here are some examples:

  1. g5.xlarge on AWS is actually 16 GiB RAM and 24 GiB VRAM, not 16 GB and 24 GB.
    > dstack run . -b aws --gpu A10G
    [... cut for brevity ...]                               
     #  BACKEND  REGION     INSTANCE   RESOURCES                                   SPOT  PRICE    
     1  aws      us-east-1  g5.xlarge  4xCPU, 16GB, 1xA10G (24GB), 100.0GB (disk)  no    $1.006
    
  2. g6.xlarge on AWS is actually 24 GB VRAM, not 22 GB.
    > dstack run . -b aws --gpu L4
    [... cut for brevity ...]
     #  BACKEND  REGION     INSTANCE   RESOURCES                                 SPOT  PRICE     
     1  aws      us-east-2  g6.xlarge  4xCPU, 16GB, 1xL4 (22GB), 100.0GB (disk)  no    $0.8048
    
  3. VM.GPU.A10.1 on OCI is actually 240 GB RAM and 24 GB VRAM, not 236 GB and 22 GB as shown when it is added with dstack pool add-ssh.
    > dstack pool ps                                           
     Pool name  default-pool
    
     INSTANCE        BACKEND  REGION  RESOURCES                                   SPOT  PRICE  STATUS  CREATED   
     tough-kangaroo  ssh      remote  30xCPU, 236GB, 1xA10 (22GB), 33.8GB (disk)  no    $0.0   idle    1 min ago
    

This ambiguity makes it difficult for users to understand what resources they will actually get and may lead to offers being filtered out while they actually match the users' requirements.

jvstme avatar Jul 24 '24 14:07 jvstme

In the context of RAM/VRAM, GB (base 10) doesn't make sense because memory is always in base 2. Most vendors use GB for GiB. This is a convention which predates GiB, e.g. NVIDIA writes that A10 has 24GB meaning 24GiB; linux reports memory in GB.

I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.

So:

  1. 16 GB and 24 GB is fine.
  2. It seems like AWS returns available VRAM (22GB) instead of total GPU VRAM (24GB)?
  3. The resources reported by the shim are expected to be less that physical RAM/VRAM (some reserved RAM/VRAM may not be reported by /proc/meminfo and nvidia-smi).

For storage, distinguishing GB and GiB is important.

r4victor avatar Jul 25 '24 05:07 r4victor

This issue is quite problematic because I requested an instance with GPU with 24GB, and it created one with 22GB. I try to run requiring 24GB and it can't use the existing instance.

peterschmidt85 avatar Jul 25 '24 10:07 peterschmidt85

I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.

For storage, distinguishing GB and GiB is important.

@r4victor, so our current policy is that dstack always means GiB when it says "GB", right? I think we can keep this policy as long as we document it. But then it is important to make sure we always stick to it, e.g. if some provider reports storage sizes in base-10 units, we should convert them to base-2 units.

jvstme avatar Jul 25 '24 12:07 jvstme

Cases 2 and 3 are apparently not related to how dstack handles GB/GiB conversions, but let me still comment on them here.

It seems like AWS returns available VRAM (22GB) instead of total GPU VRAM (24GB)?

More like AWS misreports the VRAM for L4. I compared AWS A10G and L4 instances and they both have ~22.5 GiB VRAM, as reported by nvidia-smi. Yet AWS docs and API state that A10G is 24 GiB and L4 is 24 GB.

We can either contact AWS or just hardcode 24 GiB for L4.

The resources reported by the shim are expected to be less than physical RAM/VRAM

Then we could replace the values reported by nvidia-smi with the values from KNOWN_GPUS, as long as they are approximately similar. It would solve the UX issue @peterschmidt85 mentioned:

I try to run requiring 24GB and it can't use the existing instance.

jvstme avatar Jul 25 '24 12:07 jvstme

Cases 2 and 3 were moved to https://github.com/dstackai/gpuhunt/issues/91 and #1523 respectively.

This issue will remain open to document that dstack uses base-2 units for everything and double-check that it is consistent with cloud providers.

jvstme avatar Aug 17 '24 20:08 jvstme

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Feb 12 '25 01:02 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

github-actions[bot] avatar Feb 26 '25 01:02 github-actions[bot]