Random
Random
nv的megatron-lm训练框架我们没有适配,目前是适配了fairseq和transformers,如果是megatron-lm的训练框架,需要进行模型转换。
适配分布式训练指的是训练的时候使用EET吗? 这个不行,EET是一个推理引擎,不支持反向传播。
Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the...
I recommend you to use the Easy and Efficient Transformer(EET) for inference.
@mahnerak I solved this by add num_workers=0. It seems like a bug from pytorch !
@mahnerak , did you solve it ?