Error with verify option when using convert_hf_checkpoint.py

Open joaopalotti opened this issue 2 years ago • 1 comments

Hi,

Thanks of all, thanks for developing lit-llama, great framework!

I would like to ask about the verify option in the convert_hf_checkpoint.py script. I am using it with the following command:

python scripts/convert_hf_checkpoint.py --checkpoint checkpoints/open-llama/7B/ --model_size 7B --output_dir checkpoints/lit-llama/7B700bt/ --verify True

The first error I got was regarding the device used:

RuntimeError: Tensor on device cpu is not on the expected device meta!

What is the meta device?

In any case, it is easy to fix it by setting, e.g., cpu as the device. Then, the error I got was on the other assert statement:

Initializing lit-llama
Saving to disk at checkpoints/lit-llama/7B700bt
Processing checkpoints/open-llama/7B/pytorch_model-00002-of-00002.bin
Processing checkpoints/open-llama/7B/pytorch_model-00001-of-00002.bin
Verifying...
Loading original model for comparison
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [11:32<00:00, 346.27s/it]
Comparing outputs
Traceback (most recent call last):
  File "/home/ubuntu/lit-llama/scripts/convert_hf_checkpoint.py", line 166, in <module>
    CLI(convert_hf_checkpoint)
  File "/opt/conda/envs/litllama/lib/python3.10/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/opt/conda/envs/litllama/lib/python3.10/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/opt/conda/envs/litllama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/lit-llama/scripts/convert_hf_checkpoint.py", line 160, in convert_hf_checkpoint
    assert torch.testing.assert_close(out, out_hf)
  File "/opt/conda/envs/litllama/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 65535993 / 65536000 (100.0%)
Greatest absolute difference: 17.049419403076172 at index (0, 1, 18853) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 0, 0) (up to 1.3e-06 allowed)

Running the same command with device set to cpu in another computer led me to the same error:

AssertionError: Tensor-likes are not close!

Mismatched elements: 65381392 / 65536000 (99.8%)
Greatest absolute difference: 8.33674693107605 at index (0, 1448, 162) (up to 1e-05 allowed)
Greatest relative difference: 3362.238139053835 at index (0, 1448, 19541) (up to 1.3e-06 allowed)

I have an isolated env with python 3.10 and only the packages in the setup and requirements.txt + jsonargparse (which is missing in the requirements.txt) installed.

>>> pip freeze

aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.7.0
arrow==1.2.3
async-timeout==4.0.2
attrs==23.1.0
beautifulsoup4==4.12.2
bitsandbytes==0.39.0
blessed==1.20.0
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
croniter==1.3.15
datasets==2.12.0
dateutils==0.6.12
deepdiff==6.3.0
dill==0.3.6
docstring-parser==0.15
exceptiongroup==1.1.1
fastapi==0.88.0
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.5.0
h11==0.14.0
huggingface-hub==0.15.1
idna==3.4
importlib-resources==5.12.0
inquirer==3.1.3
itsdangerous==2.1.2
Jinja2==3.1.2
jsonargparse==4.21.1
lightning @ git+https://github.com/Lightning-AI/lightning@1f670a5cbd2bce497b927a94b15138640f9eac03
lightning-cloud==0.5.36
lightning-utilities==0.8.0
lit==16.0.5
-e git+https://github.com/Lightning-AI/lit-llama@713a0b152f5f846f9aee468a879bce22d727bf4a#egg=lit_llama
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.1
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
ordered-set==4.1.0
packaging==23.1
pandas==2.0.2
psutil==5.9.5
pyarrow==12.0.0
pydantic==1.10.8
Pygments==2.15.1
PyJWT==2.7.0
python-dateutil==2.8.2
python-editor==1.0.4
python-multipart==0.0.6
pytorch-lightning==2.0.2
pytz==2023.3
PyYAML==6.0
readchar==4.0.5
regex==2023.5.5
requests==2.31.0
responses==0.18.0
rich==13.4.1
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
starlette==0.22.0
starsessions==1.3.0
sympy==1.12
tokenizers==0.13.3
torch==2.0.1
torchmetrics==0.11.4
tqdm==4.65.0
traitlets==5.9.0
transformers==4.29.2
triton==2.0.0
typeshed-client==2.3.0
typing_extensions==4.6.3
tzdata==2023.3
urllib3==2.0.2
uvicorn==0.22.0
wcwidth==0.2.6
websocket-client==1.5.2
websockets==11.0.3
xxhash==3.2.0
yarl==1.9.2
zstandard==0.21.0

Please let me know if I am missing something here. Thank you in advance. J

Jun 02 '23 19:06 joaopalotti

Thanks for reporting! I also noticed in #175. But I think it's safe to ignore.

We should remove this flag and code in favor of a test comparing with HF, as we do in Lit-Parrot: https://github.com/Lightning-AI/lit-parrot/blob/main/tests/test_model.py

Jun 02 '23 22:06 carmocca