InternEvo
InternEvo copied to clipboard
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
### Describe the bug [codespell.log](https://github.com/user-attachments/files/18884037/codespell.log) ### Environment python 3.10 ### Other information _No response_
### Describe the bug CPU memory utilization grows with training and finally cause OOM when num_workers of Dataloader greater than 0. Especially when more datasets are used, this mem growth...
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...
### Describe the bug https://github.com/InternLM/InternEvo/blob/24180aa82a2c5b8f506b589beeabf2ec2dbfadc7/internlm/initialize/launch.py#L312 hard code a enable_qkv_fusion keyword into config.model ### Environment Skip ### Other information Skip
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...
### Describe the feature LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity https://arxiv.org/pdf/2412.09856 LinGen is non-transformer based SSM Mamba2 based model. We are working with the first author...
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...
### Describe the bug https://github.com/InternLM/InternEvo/blob/5ad2eb02fb5be2196e505600fef459185070d1e3/internlm/solver/optimizer/hybrid_zero_optim.py#L842 `single_grad_partition_groups.append(flat_fp32_avg_grads)` 收集了 flat_fp32_avg_grads 用于 unscale_and_clip_grad, 但开启 cpu_offload 后,`self._fp32_flat_param_groups_of_current_rank[group_id].grad = flat_fp32_avg_grads.to(device)` 设置 grad 的 tensor to CPU 了,这样 clip grad 只作用于 single_grad_partition_groups 中的 device tensor,真正用于计算的 cpu...
特别声明:本功能模块技术路线基于veScale checkpoint和ByteCheckpoint实现。 veScale:https://github.com/volcengine/veScale/tree/main ByteCheckpoint:https://arxiv.org/abs/2407.20143 # 通用检查点系统 通用ckpt系统独立于原版ckpt系统,相互不兼容。 ## 基本功能 Dense 模型下 model ckpt 和 optimizer ckpt 的各种并行配置的动态加载支持: - [x] GPU world size - [x] tensor parallel - [x] pipeline parallel...
use bf16 logits for loss : ``` loss = dict( label_smoothing=0, op_type='flash_vocab_parallel' ) use_fp32_logits = False ``` by default `use_fp32_logits ` is True, no BC-break.