InternEvo issues

[Bug] Some spelling errors in repo

### Describe the bug [codespell.log](https://github.com/user-attachments/files/18884037/codespell.log) ### Environment python 3.10 ### Other information _No response_

kkscilife

bug

[Bug] CPU Mem utilization grows with training, when Dataloader num_workers>0

### Describe the bug CPU memory utilization grows with training and finally cause OOM when num_workers of Dataloader greater than 0. Especially when more datasets are used, this mem growth...

BradZhone

bug

check CUDA_DEVICE_MAX_CONNECTIONS and TORCH_NCCL_AVOID_RECORD_STREAMS

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...

sallyjunjun

[Bug] TypeError: Qwen2.init() got an unexpected keyword argument 'enable_qkv_fusion'

### Describe the bug https://github.com/InternLM/InternEvo/blob/24180aa82a2c5b8f506b589beeabf2ec2dbfadc7/internlm/initialize/launch.py#L312 hard code a enable_qkv_fusion keyword into config.model ### Environment Skip ### Other information Skip

woolpeeker

bug

feat(ampipe): impl ampipe

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...

blankde

[Feature] Add Lingen - MATE architecture (SSM Mamba2 based) as a model option

2

### Describe the feature LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity https://arxiv.org/pdf/2412.09856 LinGen is non-transformer based SSM Mamba2 based model. We are working with the first author...

mgyong

enhancement

Feat/loong train mla

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...

huangting4201

[Bug] loss error: flat_fp32_avg_grads will not be scaled when hybridzerooptimizer use cpu_offload

2

### Describe the bug https://github.com/InternLM/InternEvo/blob/5ad2eb02fb5be2196e505600fef459185070d1e3/internlm/solver/optimizer/hybrid_zero_optim.py#L842 `single_grad_partition_groups.append(flat_fp32_avg_grads)` 收集了 flat_fp32_avg_grads 用于 unscale_and_clip_grad, 但开启 cpu_offload 后，`self._fp32_flat_param_groups_of_current_rank[group_id].grad = flat_fp32_avg_grads.to(device)` 设置 grad 的 tensor to CPU 了，这样 clip grad 只作用于 single_grad_partition_groups 中的 device tensor，真正用于计算的 cpu...

fengsibo

bug

feat(checkpoint): support universal checkpoint

特别声明：本功能模块技术路线基于veScale checkpoint和ByteCheckpoint实现。 veScale：https://github.com/volcengine/veScale/tree/main ByteCheckpoint：https://arxiv.org/abs/2407.20143 # 通用检查点系统通用ckpt系统独立于原版ckpt系统，相互不兼容。 ## 基本功能 Dense 模型下 model ckpt 和 optimizer ckpt 的各种并行配置的动态加载支持: - [x] GPU world size - [x] tensor parallel - [x] pipeline parallel...

li126com

add use_fp32_logits flag

use bf16 logits for loss : ``` loss = dict( label_smoothing=0, op_type='flash_vocab_parallel' ) use_fp32_logits = False ``` by default `use_fp32_logits ` is True, no BC-break.

KimmiShi

InternEvo
InternEvo copied to clipboard

Metadata

[Bug] Some spelling errors in repo

[Bug] CPU Mem utilization grows with training, when Dataloader num_workers>0

check CUDA_DEVICE_MAX_CONNECTIONS and TORCH_NCCL_AVOID_RECORD_STREAMS

[Bug] TypeError: Qwen2.init() got an unexpected keyword argument 'enable_qkv_fusion'

feat(ampipe): impl ampipe

[Feature] Add Lingen - MATE architecture (SSM Mamba2 based) as a model option

Feat/loong train mla

[Bug] loss error: flat_fp32_avg_grads will not be scaled when hybridzerooptimizer use cpu_offload

feat(checkpoint): support universal checkpoint

add use_fp32_logits flag

← Metadata

Owner

Metadata

InternEvo InternEvo copied to clipboard

Metadata

← Metadata

Owner

Metadata

InternEvo
InternEvo copied to clipboard