dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

DLRover: An Automatic Distributed Deep Learning System

Results 50 dlrover issues
Sort by recently updated
recently updated
newest added

`num_workers` can affect the performance of `DataLoader` significantly. Users usually need to adjust the `num_workers` by restarting a new job. If, DLRover can auto-tune the `num_workers` without interrupting the training,...

We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.

example

We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.

example

This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...

enhancement

[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.

enhancement
example

Dear All I think we need to make our system much easier to use for the new beginner, for example, users can start their fine tune etc in a simpler...

good first issue

For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...

enhancement

Making FSDP auto-tune. There are many knobs that users can tune today with FSDP for both scaling and performance.