dstack
dstack copied to clipboard
[Roadmap] Q3 2024
This issue outlines the major items planned for Q3 2024.
For major bugs, see https://github.com/dstackai/dstack/labels/major
Core features
- [ ]
dstack apply- #1571
- #1472
- #1671
- #1223
- #1326
- #1327
- [x] Production-grade deployment of the server
- #1577
- #1632
- #1393
- [ ] Gateways
- #1664
- #1595
- #1349
- #1631
- [ ] Volumes
- #1158
- #1296
- #1469
- #1467
- [ ] Private subnets
- #1201
- #1171
- #1264
- #1304
- #1321
- #1672
- [ ] Clusters
- #1337
- #1327
- #1489
- [ ] #708
- [ ] Reserved instances
- #1155
- [x] #1419
Supported architectures
- [ ] TPU
- #956
- #1337
- [ ] AMD
- #1413
- [ ] Gaudi
Examples
[!IMPORTANT] Community help is welcome!
- [ ] AMD
- #1598
- [ ] Axolotl
- [ ] TRL
- [ ] Nim
- [ ] GitHub Actions
- [ ] VLLM/TGI
- [ ] Llama 3.2 (multi-modal)
- [ ] FLUX
- [x] Ray
- [x] Spark
- [ ] Unsloth
- [ ] Alignment Handbook with Llama 3.1
- [ ] Nemo
- [ ] TensorRT-LLM
- [ ] Triton
- [ ] Function calling
- [ ] Llama Index
- [ ] LangChain
- [x] TPU
- [x] Multi-node Alignment Handbook
- [x] Llama 3.1
- [x] Fine-tuning Llama 3.1
- [x] Axolotl
Improvements
- [ ] Support for H100 in AWS/GCP/Azure #1238 #1240 #1239
- [ ] Support for H200 in AWS/GCP/Azure once they support it
- [x] Support for L4 in AWS #1235
- [x] Troubleshooting guide #1673
- [ ] Make it easier to add custom backends (implementation and documentation)
Research
[!IMPORTANT] Research and feedback is required!
- [ ] Metrics: Research whether
dstackshould collect certain metrics out of the box (at least hardware utilization) or if it should be integrated with more enterprise-grade tools. - [ ] Fault-tolerant training: Research how
dstackcan be used for fault-tolerant training of massive models.