HAMi Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization

Open michael-nammi opened this issue 9 months ago • 3 comments

Environment

Kubernetes version: v1.27.9 HAMi version: v2.3.9

Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU
- Node 2: 4GiB Memory, 1 GPU
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB)
Create Deployment A:
- Replicas: 1
- Memory requirement: 16GiB
- GPU requirement: 2
Create Deployment B:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
Delete Deployment A
Modify Deployment B
- Change replicas to 3
- Change memory requirement to 8GiB
- Change GPU requirement to 2

Expected Behavior

The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.

Node 1: 4GiB Memory occupied by the pre-existing resources of Deployment B
Node 2: Unchanged (idle)
Node 3: 2 replicas of Deployment B fully occupied the memory of two GPUs

Actual Behavior

The update fails, but the node resource allocation is incorrectly reported:

Node 1: 4GiB Memory
Node 2: Unchanged (idle)
Node 3: Resources are reported as 8GiB and 12GiB, which is inconsistent with the expected result of having all GPUs with full memory

Prometheus Metrics

Bug 2: Incorrect GPU Utilization

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 2: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB Memory and Max Utilization: 100%)
Create Deployment A:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
- GPUcores requirement: 120 (which implies a requirement of more than 100% GPU utilization if taking "100" as the maximum)

Expected Behavior

The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.

Node 1: Memory should remain unallocated (4GiB)
Node 2: Memory should remain unallocated (4GiB)
Node 3: Both GPUs should remain unallocated (16GiB + 100%, and 16GiB + 100%)

Actual Behavior

The deployment is incorrectly scheduled with the following resource allocation:

Node 1: Unchanged (4GiB Memory idle)
Node 2: Unchanged (4GiB Memory idle)
Node 3: Resources are reported incorrectly:
- First GPU: Appears as if 4GiB Memory + 100% Utilization has been allocated to Deployment A (should be no allocation)
- Second GPU: Unallocated (16GiB Memory and 100% Utilization idle)

May 10 '24 08:05 michael-nammi

HAMi HAMi copied to clipboard

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization

Environment

Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources

Steps to reproduce the issue

Expected Behavior

Actual Behavior

Prometheus Metrics

Bug 2: Incorrect GPU Utilization

Steps to reproduce the issue

Expected Behavior

Actual Behavior

HAMi
HAMi copied to clipboard