InternEvo issues

feat(simulator): support parallel cost simulator for internevo

# InternLM Simulator ## 1. Introduction The solver mainly consists of two components: 1. `profiling`: Collects the time consumption of each stage during the model training process in advance and...

SolenoidWGT

[Feature] MoE模型里稠密层和专家层zero和并行的解耦

### Describe the feature MoE模型里稠密层和专家层zero和并行的解耦 ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

sunpengsdu

enhancement

[Feature] 不使用memory pool

1

### Describe the feature 实际使用过程中，不需要memory_pool，memory pool的逻辑可能和其他芯片的显存分配策略有冲突，建议统一去除memory pool的实现和使用，包括moe对memory pool的使用 ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

sunpengsdu

enhancement

[QA] check import system var at the start of training

### Describe the question.

sunpengsdu

question

feat(checkpoint): TP recomputation communication optimization

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand...

li126com

[Feature] how to finetuning lora

1

### Describe the feature Does InternEvo support fine-tuning using LoRA? ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

wen020

enhancement

[Bug] RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

### Describe the bug It's a probabilistic occurrence, Socket Timeout when group.allreduce([tensor], opts) if group in _world.pg_coalesce_state.keys(): # We are in coalescing context, do not issue single operation, just append...

kkscilife

bug

[QA] Internevo是否支持tied_embedding?

3

### Describe the question. 请问 Internevo是否支持tied_embedding?有的话怎么使用呢？

Cerberous

question

[Bug] 仅支持了GShard模式的MoE模型转huggingface

### Describe the bug 1. 之前给出的脚本仅支持了GShard MoE训练的方式转化hf的脚本，但是如果用MegaBlock进行训练的话权重转换脚本就不适用了。 2. 仍然未提供已经训练好的Internevo的权重转换成internevo MoE权重的脚本。 ### Environment 官方镜像 ### Other information _No response_

Cerberous

bug

[Bug] 训练bf16 infer fp16出现NaN

### Describe the bug 我来重新描述一下我的问题，我在用internevo训练的时候用的bf16，然后转换成hf后用fp16推理遇到了下述报错 ``` Traceback (most recent call last): File "/InternLM/hf_test.py", line 15, in output = model.generate(**inputs, **gen_kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File...

Cerberous

bug

InternEvo
InternEvo copied to clipboard

Metadata

feat(simulator): support parallel cost simulator for internevo

[Feature] MoE模型里稠密层和专家层zero和并行的解耦

[Feature] 不使用memory pool

[QA] check import system var at the start of training

feat(checkpoint): TP recomputation communication optimization

[Feature] how to finetuning lora

[Bug] RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

[QA] Internevo是否支持tied_embedding?

[Bug] 仅支持了GShard模式的MoE模型转huggingface

[Bug] 训练bf16 infer fp16出现NaN

← Metadata

Owner

Metadata

InternEvo InternEvo copied to clipboard

Metadata

← Metadata

Owner

Metadata

InternEvo
InternEvo copied to clipboard