Tingfeng Lan
Tingfeng Lan
There are potential OOM issues in `simple_strategy_generator` during multiple rounds tuning. Now we set a [threshold](https://github.com/intelligent-machine-learning/dlrover/pull/686/files/0c12ea0332564ee54c68737bfb6a88cbcbded0d5#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a) (e.g. 2400MB) to avoid this problem.
Now we just calculate straight forward `updated batch size` from `original batch size` in [strategy generator](https://github.com/intelligent-machine-learning/dlrover/pull/686/files#diff-3c1877daf4263e9ad41cc0afa1be975f07da926c9b87eed9a06bc4a479bf8c0a). There are multiple libraries optimized for batch_size in the power of 2. (e.g. 2、4、8、16、32)....
We need to provide support for Jupyter notebooks, as they are the preferred environment for data scientists to conduct training jobs. Jupyter notebooks offer a user-friendly interface and interactive programming...
We need to support ChatGLM (the best Chinese LLM) finetune. This would be a best practice for users to optimize fine-tune jobs with dlrover.
We need to support Stanford alpaca finetune. This would be a best practice for users to optimize fine-tune job with dlrover.
This issue aims to implement the capability to monitor the synchronization time during the Allreduce process in distributed training. Monitoring Allreduce time is critical to understanding the efficiency of our...
[Llama](https://ai.meta.com/llama/) is the most popular open-source LLM base model. Try to support llama training or fine-tuning. [BabyLlama](https://github.com/karpathy/llama2.c) should be a good startup example.
For this issue, the objective is to create a hook or callback system in our PyTorch trainer that would allow it to invoke resource monitoring and time reporting at the...
## Description Currently, DeepSpeed offers `--bind_cores_to_rank` and `--bind_core_list` flags to bind CPU cores, but these require explicit specification from the user. While core binding works, it is not fully automated...