Tingfeng Lan issues

Results 9 issues of


                                            Tingfeng Lan

Potential OOM issues in `simple_strategy_generator`

There are potential OOM issues in `simple_strategy_generator` during multiple rounds tuning. Now we set a [threshold](https://github.com/intelligent-machine-learning/dlrover/pull/686/files/0c12ea0332564ee54c68737bfb6a88cbcbded0d5#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a) (e.g. 2400MB) to avoid this problem.

Keep the batch size even in strategy generator.

Now we just calculate straight forward `updated batch size` from `original batch size` in [strategy generator](https://github.com/intelligent-machine-learning/dlrover/pull/686/files#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a). There are multiple libraries optimized for batch_size in the power of 2. (e.g. 2、4、8、16、32)....

good first issue

Run DLRover in jupyter.

We need to provide support for Jupyter notebooks, as they are the preferred environment for data scientists to conduct training jobs. Jupyter notebooks offer a user-friendly interface and interactive programming...

enhancement

[Example] Add ChatGLM support example.

We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.

example

[Example] Add alpaca fine-tune support example.

We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.

example

[Feature]Add Allreduce Synchronization Time Monitoring.

This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...

enhancement

[Example]Expand the nanogpt example to babyllama.

[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.

enhancement

example

Torch Trainer Hook

For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...

enhancement

[REQUEST] Auto-Tuning CPU Core Binding for DeepSpeed&ZenFlow

## Description Currently, DeepSpeed offers `--bind_cores_to_rank` and `--bind_core_list` flags to bind CPU cores, but these require explicit specification from the user. While core binding works, it is not fully automated...

enhancement