dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Service re-run terminates despite available fleet capacity.

Open Bihan opened this issue 1 week ago • 4 comments

Steps to reproduce

Configs:

# my_cpu_fleet.yml
type: fleet
name: cpu-default

nodes: 0..8

resources:
  cpu: 2
# simple-service-replicas.yml
type: service
name: simple-service-replicas
https: false
python: 3.12


commands:
  - echo "Group default - Version 1" > /tmp/version.txt
  - python3 -m http.server 8000

port: 8000

resources:
  cpu: 2

replicas: 5

Step1: Create Fleet: dstack apply -f my_cpu_fleet.yml

Step2: Apply Service Config dstack apply -f simple-service-replicas.yml

The first run works as expected

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS   SUBMITTED  
 simple-service-replicas                   -    -        running  5 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  running  5 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running  5 mins ago 
dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED    
 default      -         -                -                         -        -       4 days ago 
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    4 mins ago 

Step3: Stop the run. fleet instances are idle as expected.

FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED    
default      -         -                -                         -        -       4 days ago 
cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    6 mins ago 
             1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 
             4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    5 mins ago 

Step4: Once Again apply: dstack apply -f simple-service-replicas.yml

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS   SUBMITTED  
 simple-service-replicas                   -    -        running  56 sec ago 
   replica=0              aws (us-east-2)  -    $0.0832  running  55 sec ago 
   replica=1              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=2              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=3              aws (us-east-2)  -    $0.0832  pulling  55 sec ago 
   replica=4              aws (us-east-2)  -    $0.0832  running  55 sec ago 

All the fleet instances are expected to be busy when replica's are pulling/running, but some are idle as below:

dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED     
 default      -         -                -                         -        -       4 days ago  
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 

Step5: Check the run after a while

dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS       SUBMITTED  
 simple-service-replicas                   -    -        terminating  3 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running      3 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running      3 mins ago

The run gets terminated.

Actual behaviour

The run gets terminated on re-run even when fleet has idle instances.

dstack fleet
 FLEET        INSTANCE  BACKEND          RESOURCES                 PRICE    STATUS  CREATED     
 default      -         -                -                         -        -       4 days ago  
 cpu-default  0         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
              1         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              2         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              3         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  idle    10 mins ago 
              4         aws (us-east-2)  cpu=2 mem=8GB disk=100GB  $0.0832  busy    10 mins ago 
dstack ps
 NAME                     BACKEND          GPU  PRICE    STATUS       SUBMITTED  
 simple-service-replicas                   -    -        terminating  3 mins ago 
   replica=0              aws (us-east-2)  -    $0.0832  running      3 mins ago 
   replica=1              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=2              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=3              aws (us-east-2)  -    $0.0832  terminating  3 mins ago 
   replica=4              aws (us-east-2)  -    $0.0832  running      3 mins ago

Expected behaviour

The re-run should not be terminated and idle fleet instances should be utilized.

dstack version

master (commit: b2be6a7e)

Server logs


Additional information

server_logs_fleet_issue.txt

Bihan avatar Dec 19 '25 14:12 Bihan