dstack

dstack copied to clipboard

Reame
Issues

[Roadmap] Q3 2024

Open peterschmidt85 opened this issue 1 year ago • 0 comments

This issue outlines the major items planned for Q3 2024.

For major bugs, see https://github.com/dstackai/dstack/labels/major

Core features

[ ] dstack apply
- #1571
- #1472
- #1671
- #1223
- #1326
- #1327
[x] Production-grade deployment of the server
- #1577
- #1632
- #1393
[ ] Gateways
- #1664
- #1595
- #1349
- #1631
[ ] Volumes
- #1158
- #1296
- #1469
- #1467
[ ] Private subnets
- #1201
- #1171
- #1264
- #1304
- #1321
- #1672
[ ] Clusters
- #1337
- #1327
- #1489
[ ] #708
[ ] Reserved instances
- #1155
[x] #1419

Supported architectures

[ ] TPU
- #956
- #1337
[ ] AMD
- #1413
[ ] Gaudi

Examples

[!IMPORTANT] Community help is welcome!

[ ] AMD
- #1598
- [ ] Axolotl
- [ ] TRL
[ ] Nim
[ ] GitHub Actions
[ ] VLLM/TGI
[ ] Llama 3.2 (multi-modal)
[ ] FLUX
[x] Ray
[x] Spark
[ ] Unsloth
[ ] Alignment Handbook with Llama 3.1
[ ] Nemo
[ ] TensorRT-LLM
[ ] Triton
[ ] Function calling
[ ] Llama Index
[ ] LangChain
[x] TPU
[x] Multi-node Alignment Handbook
[x] Llama 3.1
[x] Fine-tuning Llama 3.1
[x] Axolotl

Improvements

[ ] Support for H100 in AWS/GCP/Azure #1238 #1240 #1239
[ ] Support for H200 in AWS/GCP/Azure once they support it
[x] Support for L4 in AWS #1235
[x] Troubleshooting guide #1673
[ ] Make it easier to add custom backends (implementation and documentation)

Research

[!IMPORTANT] Research and feedback is required!

[ ] Metrics: Research whether dstack should collect certain metrics out of the box (at least hardware utilization) or if it should be integrated with more enterprise-grade tools.
[ ] Fault-tolerant training: Research how dstack can be used for fault-tolerant training of massive models.

Jun 24 '24 10:06 peterschmidt85