Random

Results 6 comments of Random

nv的megatron-lm训练框架我们没有适配,目前是适配了fairseq和transformers,如果是megatron-lm的训练框架,需要进行模型转换。

适配分布式训练指的是训练的时候使用EET吗? 这个不行,EET是一个推理引擎,不支持反向传播。

Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the...

I recommend you to use the Easy and Efficient Transformer(EET) for inference.

@mahnerak I solved this by add num_workers=0. It seems like a bug from pytorch !

@mahnerak , did you solve it ?