Bihan Rana
Bihan Rana
**Summary** After setting compute partition to `CPX` and memory partition to `NPS4`, only 8 GPUs **(indices 0, 8, 16, 24, 32, 40, 48, 56)** show valid COMPUTE_PARTITION: CPX and MEMORY_PARTITION:...
**Steps To Test** Step1: Create `replica-groups-service.yml` ``` # replica-groups-service.yml type: service name: replica-groups-test python: 3.12 replica_groups: - name: replica-1 replicas: 0..2 scaling: metric: rps target: 2 commands: - echo "Group...
### Steps to reproduce Configs: ``` # my_cpu_fleet.yml type: fleet name: cpu-default nodes: 0..8 resources: cpu: 2 ``` ``` # simple-service-replicas.yml type: service name: simple-service-replicas https: false python: 3.12 commands:...
This tracks the roadmap for implementing native inference capabilities inside dstack. Currently LLM inference systems (SGLang, Dynamo, Grove, LLM-d, Ai-brix, SGLang OME) revolve around inference-native concepts: TTFT/ITL autoscaling, PD disaggregation,...
### Problem Time To First Token (TTFT) and Inter-Token Latency (ITL) directly reflect user experience: **TTFT:** Time until the first token appears (responsiveness) **ITL:** Time between subsequent tokens (generation speed)...
### Problem Currently the` env:` configuration does not support variable interpolation. This means that when we define environment variables like: ``` env: - NUM_SHARD=$DSTACK_GPUS_NUM ``` The value is not evaluated...