dlrover
dlrover copied to clipboard
DLRover: An Automatic Distributed Deep Learning System
`num_workers` can affect the performance of `DataLoader` significantly. Users usually need to adjust the `num_workers` by restarting a new job. If, DLRover can auto-tune the `num_workers` without interrupting the training,...
We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.
We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.
This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...
[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.
Dear All I think we need to make our system much easier to use for the new beginner, for example, users can start their fine tune etc in a simpler...
For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...
Making FSDP auto-tune. There are many knobs that users can tune today with FSDP for both scaling and performance.