dlrover issues

[Feature]: Dynamically adjust the num_workers of dataloader.

`num_workers` can affect the performance of `DataLoader` significantly. Users usually need to adjust the `num_workers` by restarting a new job. If, DLRover can auto-tune the `num_workers` without interrupting the training,...

workingloong

[Example] Add ChatGLM support example.

2

We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.

Antlera

example

[Example] Add alpaca fine-tune support example.

1

We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.

Antlera

example

[Feature]Add Allreduce Synchronization Time Monitoring.

This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...

Antlera

enhancement

[Example]Expand the nanogpt example to babyllama.

[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.

Antlera

enhancement

example

[Doc] Update the readme to illustrate that we also provide the local model running

@workingloong

merlintang

[Doc] Help the DS or MLE to start the fine tune sooner.

1

Dear All I think we need to make our system much easier to use for the new beginner, for example, users can start their fine tune etc in a simpler...

merlintang

good first issue

Torch Trainer Hook

2

For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...

Antlera

enhancement

Develop algorithms for auto-tuning both GPU memory usage and training performance.

Making FSDP auto-tune. There are many knobs that users can tune today with FSDP for both scaling and performance.

workingloong

A document to deploy the Brain service to optimize a job.

workingloong

dlrover-brain

dlrover
dlrover copied to clipboard

Metadata

[Feature]: Dynamically adjust the num_workers of dataloader.

[Example] Add ChatGLM support example.

[Example] Add alpaca fine-tune support example.

[Feature]Add Allreduce Synchronization Time Monitoring.

[Example]Expand the nanogpt example to babyllama.

[Doc] Update the readme to illustrate that we also provide the local model running

[Doc] Help the DS or MLE to start the fine tune sooner.

Torch Trainer Hook

Develop algorithms for auto-tuning both GPU memory usage and training performance.

A document to deploy the Brain service to optimize a job.

← Metadata

Owner

Metadata

dlrover dlrover copied to clipboard

Metadata

← Metadata

Owner

Metadata

dlrover
dlrover copied to clipboard