dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Roadmap] Q3 2024

Open peterschmidt85 opened this issue 1 year ago • 0 comments

This issue outlines the major items planned for Q3 2024.

For major bugs, see https://github.com/dstackai/dstack/labels/major

Core features

  • [ ] dstack apply
    • #1571
    • #1472
    • #1671
    • #1223
    • #1326
    • #1327
  • [x] Production-grade deployment of the server
    • #1577
    • #1632
    • #1393
  • [ ] Gateways
    • #1664
    • #1595
    • #1349
    • #1631
  • [ ] Volumes
    • #1158
    • #1296
    • #1469
    • #1467
  • [ ] Private subnets
    • #1201
    • #1171
    • #1264
    • #1304
    • #1321
    • #1672
  • [ ] Clusters
    • #1337
    • #1327
    • #1489
  • [ ] #708
  • [ ] Reserved instances
    • #1155
  • [x] #1419

Supported architectures

  • [ ] TPU
    • #956
    • #1337
  • [ ] AMD
    • #1413
  • [ ] Gaudi

Examples

[!IMPORTANT] Community help is welcome!

  • [ ] AMD
    • #1598
    • [ ] Axolotl
    • [ ] TRL
  • [ ] Nim
  • [ ] GitHub Actions
  • [ ] VLLM/TGI
  • [ ] Llama 3.2 (multi-modal)
  • [ ] FLUX
  • [x] Ray
  • [x] Spark
  • [ ] Unsloth
  • [ ] Alignment Handbook with Llama 3.1
  • [ ] Nemo
  • [ ] TensorRT-LLM
  • [ ] Triton
  • [ ] Function calling
  • [ ] Llama Index
  • [ ] LangChain
  • [x] TPU
  • [x] Multi-node Alignment Handbook
  • [x] Llama 3.1
  • [x] Fine-tuning Llama 3.1
  • [x] Axolotl

Improvements

  • [ ] Support for H100 in AWS/GCP/Azure #1238 #1240 #1239
  • [ ] Support for H200 in AWS/GCP/Azure once they support it
  • [x] Support for L4 in AWS #1235
  • [x] Troubleshooting guide #1673
  • [ ] Make it easier to add custom backends (implementation and documentation)

Research

[!IMPORTANT] Research and feedback is required!

  • [ ] Metrics: Research whether dstack should collect certain metrics out of the box (at least hardware utilization) or if it should be integrated with more enterprise-grade tools.
  • [ ] Fault-tolerant training: Research how dstack can be used for fault-tolerant training of massive models.

peterschmidt85 avatar Jun 24 '24 10:06 peterschmidt85