HAMi
HAMi copied to clipboard
Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization
Environment
Kubernetes version: v1.27.9 HAMi version: v2.3.9
Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
- Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU
- Node 2: 4GiB Memory, 1 GPU
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB)
- Create Deployment A:
- Replicas: 1
- Memory requirement: 16GiB
- GPU requirement: 2
- Create Deployment B:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
- Delete Deployment A
- Modify Deployment B
- Change replicas to 3
- Change memory requirement to 8GiB
- Change GPU requirement to 2
Expected Behavior
The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.
- Node 1: 4GiB Memory occupied by the pre-existing resources of Deployment B
- Node 2: Unchanged (idle)
- Node 3: 2 replicas of Deployment B fully occupied the memory of two GPUs
Actual Behavior
The update fails, but the node resource allocation is incorrectly reported:
- Node 1: 4GiB Memory
- Node 2: Unchanged (idle)
- Node 3: Resources are reported as 8GiB and 12GiB, which is inconsistent with the expected result of having all GPUs with full memory
Prometheus Metrics
Bug 2: Incorrect GPU Utilization
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
- Pre-conditions:
- Node 1: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 2: 4GiB Memory, 1 GPU (Max Utilization: 100%)
- Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB Memory and Max Utilization: 100%)
- Create Deployment A:
- Replicas: 1
- Memory requirement: 4GiB
- GPU requirement: 1
- GPUcores requirement: 120 (which implies a requirement of more than 100% GPU utilization if taking "100" as the maximum)
Expected Behavior
The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.
- Node 1: Memory should remain unallocated (4GiB)
- Node 2: Memory should remain unallocated (4GiB)
- Node 3: Both GPUs should remain unallocated (16GiB + 100%, and 16GiB + 100%)
Actual Behavior
The deployment is incorrectly scheduled with the following resource allocation:
- Node 1: Unchanged (4GiB Memory idle)
- Node 2: Unchanged (4GiB Memory idle)
- Node 3: Resources are reported incorrectly:
- First GPU: Appears as if 4GiB Memory + 100% Utilization has been allocated to Deployment A (should be no allocation)
- Second GPU: Unallocated (16GiB Memory and 100% Utilization idle)