fairseq Can't load NLLB MoE model with torch.load

Can't load NLLB MoE model with torch.load

Open fiqas opened this issue 2 years ago • 0 comments

🐛 Bug

I'm trying to open and investigate NLLB MoE model (405GB), but can't load it into torch. Smaller dense models seem to load fine, can access the checkpoint's parameters etc.

To Reproduce

Run cmd '....'

>> python3
>> import torch
>> checkpoint = torch.load("nllb200moe54bmodel", map_location=torch.device('cpu'))

See error

  File "/data/user/model_info.py", line 18, in main
    checkpoint = torch.load(args.model, map_location=torch.device('cpu'))
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 762, in _legacy_load
    return legacy_load(f)
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 687, in legacy_load
    tar.extract('storages', path=tmpdir)
  File "/home/user/.conda/envs/nllb/lib/python3.9/tarfile.py", line 2077, in extract
    tarinfo = self.getmember(member)
  File "/home/user/.conda/envs/nllb/lib/python3.9/tarfile.py", line 1799, in getmember
    raise KeyError("filename %r not found" % name)
KeyError: "filename 'storages' not found"

Code sample

import torch
checkpoint = torch.load("nllb200moe54bmodel", map_location=torch.device('cpu'))

Expected behavior

It should load with no error and parameters should be accessible.

Environment

fairseq Version: nllb branch
PyTorch Version: '1.10.1+cu113'
OS (e.g., Linux): Ubuntu 22.04.1 LTS
How you installed fairseq (pip, source): source
Build command you used (if compiling from source):

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout nllb
pip install -e .
python setup.py build_ext --inplace

Python version: 3.9.13
CUDA/cuDNN version: 11.6
GPU models and configuration: CPU only just to load the model into memory
Any other relevant information:

Additional context

Dec 01 '22 14:12 fiqas

fairseq fairseq copied to clipboard

Can't load NLLB MoE model with torch.load

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

fairseq
fairseq copied to clipboard