Tron1994 comments

Results 4 comments of


                                            Tron1994

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2860777) of binary:

> I used the code in https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py。 config.py : from colossalai.nn.optimizer import HybridAdam from colossalai.zero.shard_utils import TensorShardStrategy from titans.model.gpt import gpt2_small, gpt2_36B BATCH_SIZE = 2 NUM_EPOCHS = 60 SEQ_LEN =...

gpt+zero3测试报错：RuntimeError: CUDA error: device-side assert triggered

问题定位：Assertion ‘srcIndex ＜ srcSelectDimSize‘ failed 原因：采用了yuan语料vocab.txt，和gpt默认vocab_size有出入，需要手动修改，可在配置gpt_zero3.py中修改： model = dict( type=gpt2_small, vocab_size=53227, checkpoint=True, ) 希望修复

请问有蒸馏的loss曲线可以参考吗

发现teacher用chat模型比用base模型要好

知识蒸馏结果复现

有发现什么问题吗，我跑了很多实验，distil_loss也基本是持平的，非常微弱地下降 ![1731924431050](https://github.com/user-attachments/assets/9600c7ae-dc23-44c9-a43a-eec39d4111e5) ![image](https://github.com/user-attachments/assets/83d24a09-f1e4-49ff-b0cd-56cd7d033cf3)