fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Can't load NLLB MoE model with torch.load

Open fiqas opened this issue 2 years ago • 0 comments

🐛 Bug

I'm trying to open and investigate NLLB MoE model (405GB), but can't load it into torch. Smaller dense models seem to load fine, can access the checkpoint's parameters etc.

To Reproduce

  1. Run cmd '....'
>> python3
>> import torch
>> checkpoint = torch.load("nllb200moe54bmodel", map_location=torch.device('cpu'))
  1. See error
  File "/data/user/model_info.py", line 18, in main
    checkpoint = torch.load(args.model, map_location=torch.device('cpu'))
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 762, in _legacy_load
    return legacy_load(f)
  File "/home/user/.conda/envs/nllb/lib/python3.9/site-packages/torch/serialization.py", line 687, in legacy_load
    tar.extract('storages', path=tmpdir)
  File "/home/user/.conda/envs/nllb/lib/python3.9/tarfile.py", line 2077, in extract
    tarinfo = self.getmember(member)
  File "/home/user/.conda/envs/nllb/lib/python3.9/tarfile.py", line 1799, in getmember
    raise KeyError("filename %r not found" % name)
KeyError: "filename 'storages' not found"

Code sample

import torch
checkpoint = torch.load("nllb200moe54bmodel", map_location=torch.device('cpu'))

Expected behavior

It should load with no error and parameters should be accessible.

Environment

  • fairseq Version: nllb branch
  • PyTorch Version: '1.10.1+cu113'
  • OS (e.g., Linux): Ubuntu 22.04.1 LTS
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source):
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout nllb
pip install -e .
python setup.py build_ext --inplace
  • Python version: 3.9.13
  • CUDA/cuDNN version: 11.6
  • GPU models and configuration: CPU only just to load the model into memory
  • Any other relevant information:

Additional context

fiqas avatar Dec 01 '22 14:12 fiqas