litgpt Abnormal Output from Gemma Pretrained Model After Conversion to Hugging Face Format

Bug description

I have trained gemma base model with custom data. After training I have converted the pretrained checkpoint to litgpt. This was the command. litgpt convert_pretrained_checkpoint my_pretrained_checkpoint litgpt_checkpoint

After that I have tested the model with - litgpt chat litgpt_checkpoint . With this command the model works fine and the generation quality was excellent.

Then I converted the litgpt checkpoint to hf checkpoint with this command - litgpt convert_from_litgpt litgpt_checkpoint hf_checkpoint. It saves a model.pth file in hf_checkpoint directory. I loaded the .pth file and loaded in huggingface model. But when I tested the model the generation was random this time. Here is the code -

from transformers import Gemma2ForCausalLM, AutoTokenizer


state_dict = torch.load("hf_checkpoint/model.pth")
model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-2b", local_files_only=True, state_dict=state_dict )
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("আমাদের দেশের"))

The output is - [{'generated_text': 'আমাদের দেশেরinninninninninninninninninninninninninn'}]

I'm not sure if I'm missing something. Can anyone help with converting the pretrained checkpoint?

What operating system are you using?

Linux

LitGPT Version

litgpt                   0.4.12
transformers             4.44.2
torch                    2.4.1

Sep 30 '24 08:09 SKNahin

Hi there, do you remember what the output was before the conversion? It would be useful to know to make sure that it was trained well.

Oct 02 '24 12:10 rasbt

Hi there, do you remember what the output was before the conversion? It would be useful to know to make sure that it was trained well.

I have used the following code to test litgpt checkpoint.

from litgpt import LLM
llm = LLM.load(litgpt_checkpoint)
text = llm.generate("আমাদের দেশের")
print(text)

The output was - সবচেয়ে বড় পদ প্রাপ্ত খেলোয়াড়ের নাম কুসুম। তার জন্ম ২০০৪-এর পর। তুখোড় মেধাবী।

I have also generated long texts with the model, for example, 4096 new tokens with a single prompt. I used litgpt chat litgpt_checkpoint command. Here is an example of that (truncated).

Prompt: আমাদের দেশের রাজনৈতিক 
>> Reply:  দলগুলি আমাদের দেশের রাজনৈতিক দলের ভবিষ্যতের উন্নতির জন্য কাজ করে যাচ্ছেন। দেশের দীর্ঘমেয়াদী স্থিতিশীলতা নিশ্চিত করতে, আমাদের দেশের আইনপ্রণেতাদের দুর্বার গণ-মিছিলের অধীনে রাজনৈতিক আন্দোলনকে এগিয়ে নিয়ে যেতে হবে।

সিটি মেয়র ও কাউন্সিলরদের নির্বাচন শান্তিপূর্ণ ও উৎসবমুখর পরিবেশে অনুষ্ঠিত হয়েছে। নির্বাচন আনুষ্ঠানিকভাবে শেষ হয় গতকাল সকাল ৮টার দিকে।
চট্টগ্রাম সিটি করপোরেশনের ৪০টি ওয়ার্ডে প্রথমবারের মতো ইলেক্ট্রনিক ভোটিং মেশিনে (ইভিএম) ভোটগ্রহণ করা হয়। দ্বিতীয়বার বিতর্কিত ওয়ার্ড হিসেবে অনুষ্ঠিত হয় ইলেক্ট্রনিক ভোটিং মেশিন (ইভিএম)। মেশিনটির ভোট পড়েছে ৭১ দশমিক ৩৬ ভোট। ভোট পড়া মানুষের মধ্যে ইভিএমে মেশিনের ব্যবহার নিয়ে কৌতূহলের সৃষ্টি হয়েছে।
ভোট দেয়ার সময় ইভিএমের মাধ্যমে প্রথম প্রথম মেশিনে ভোট দেয়া হলেও দ্বিতীয় প্রথম মেশিনের সঙ্গে ব্যবহার করা হয়েছে ইভিএম। এ সময় ভোটারেরা ভোটকেন্দ্রে না গিয়ে নিজ নিজ বাড়ির ভোটকক্ষে ভোট দেন। ভোটকেন্দ্র কর্তৃপক্ষ সংশ্লিষ্টদের সঙ্গে কথা বলে জানা গেছে, ইভিএমের মাধ্যমে অবাধ, সুষ্ঠু ও শান্তিপূর্ণ পরিবেশে ভোটগ্রহণের জন্য নির্বাচন কমিশন ও নির্বাচন কমিশন-সরকারের প্রতি দাবি জানিয়েছে।
ইভিএমে সারাদেশের মতো চট্টগ্রামেও ভোটগ্রহণ অনুষ্ঠিত হয়েছে। আজ চট্টগ্রাম সিটি করপোরেশনের ২৫০টি ওয়ার্ডের প্রথম দফায় ১০৯টি ওয়ার্ডের নতুন ওয়ার্ডগুলোতে ভোটগ্রহণ অনুষ্ঠিত হয়েছে।

Additionally, I tried to use tests/test_convert_lit_checkpoint.py file locally to check the conversion of gemma2 with the following code.

import torch
from test_convert_lit_checkpoint import test_against_original_gemma_2
test_against_original_gemma_2("google/gemma-2-2b", "cuda:0", torch.float32)

The test wasn't successful many times (not every time). There were mismatches. For example, I got -

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 3
      1 import torch
      2 from test_convert_lit_checkpoint import test_against_original_gemma_2
----> 3 test_against_original_gemma_2("google/gemma-2-2b", "cuda:0", torch.float32)

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File /workspace/nemo/llm-pretraining/test_convert_lit_checkpoint.py:464, in test_against_original_gemma_2(model_name, device, dtype)
    462 print(ours_y, theirs_y)
    463 print(torch.sum(ours_y-theirs_y))
--> 464 torch.testing.assert_close(ours_y, theirs_y, rtol=3e-5, atol=3e-5)

File /usr/local/lib/python3.10/dist-packages/torch/testing/_comparison.py:1524, in assert_close(actual, expected, allow_subclasses, rtol, atol, equal_nan, check_device, check_dtype, check_layout, check_stride, msg)
   1502 error_metas = not_close_error_metas(
   1503     actual,
   1504     expected,
   (...)
   1519     msg=msg,
   1520 )
   1522 if error_metas:
   1523     # TODO: compose all metas into one AssertionError
-> 1524     raise error_metas[0].to_error(msg)

AssertionError: Tensor-likes are not close!

Mismatched elements: 1814841 / 5120000 (35.4%)
Greatest absolute difference: 0.015489339828491211 at index (0, 12, 207875) (up to 3e-05 allowed)
Greatest relative difference: 412.79998779296875 at index (0, 16, 63337) (up to 3e-05 allowed)

Oct 02 '24 20:10 SKNahin