SuperRobin
SuperRobin
Is there example of how to use the hybrid-sp in Megatron-LM?
**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...
# What does this PR do? DeepSeekV3-671B-BF16 Lora Finetune Fixes #6824 Fixes #6829 ## Before submitting - [ ] Did you read the [contributor guideline](https://github.com/hiyouga/LLaMA-Factory/blob/main/.github/CONTRIBUTING.md)? - [ ] Did you...
### Reminder - [x] I have read the above rules and searched the existing issues. ### System Info 环境: torch-2.6.0 deepspee-0.16.4 报错: 打开unsloth-offload开关后, 反向传播报错: ### Reproduction ```text Put your message...
当训练上下文较长(64-128k)的多图数据时, 3b/7b模型训练几个episode后, 随机出现 OOM killer 失败退出: 环境信息: torch-2.6.0 deepspeed-0.16.4 vllm-0.7.3 训练配置:6机-NoHybrid模式 --ref_num_nodes 4 --ref_num_gpus_per_node 8 --actor_num_nodes 4 --actor_num_gpus_per_node 8 --colocate_actor_ref --vllm_num_engines 4 --vllm_tensor_parallel_size 4 --vllm_gpu_memory_utilization 0.3 --vllm_sync_backend gloo \ 报错信息:...