Tingfeng Lan

Results 9 issues of Tingfeng Lan

There are potential OOM issues in `simple_strategy_generator` during multiple rounds tuning. Now we set a [threshold](https://github.com/intelligent-machine-learning/dlrover/pull/686/files/0c12ea0332564ee54c68737bfb6a88cbcbded0d5#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a) (e.g. 2400MB) to avoid this problem.

Now we just calculate straight forward `updated batch size` from `original batch size` in [strategy generator](https://github.com/intelligent-machine-learning/dlrover/pull/686/files#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a). There are multiple libraries optimized for batch_size in the power of 2. (e.g. 2、4、8、16、32)....

good first issue

We need to provide support for Jupyter notebooks, as they are the preferred environment for data scientists to conduct training jobs. Jupyter notebooks offer a user-friendly interface and interactive programming...

enhancement

We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.

example

We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.

example

This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...

enhancement

[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.

enhancement
example

For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...

enhancement

## Description Currently, DeepSpeed offers `--bind_cores_to_rank` and `--bind_core_list` flags to bind CPU cores, but these require explicit specification from the user. While core binding works, it is not fully automated...

enhancement