litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

RoPE precision issue

Open carmocca opened this issue 2 years ago • 3 comments

One of the CUDA tests is failing: pytest tests/test_model.py::test_bfloat16_llama_init

E       RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::BFloat16 instead.

I think there's a bug in how the dtype is managed in rope

Originally posted by @carmocca in https://github.com/Lightning-AI/lit-stablelm/pull/11#discussion_r1186292614

carmocca avatar May 05 '23 17:05 carmocca

Adding

    roped = (x * cos) + (rotated * sin)
    return roped.type_as(x)

Fixes the error above, but the test still fails. Generation looks fine though

pytest tests/test_model.py::test_model_bfloat16  -s

    @pytest.mark.skipif(not torch.cuda.is_available(), reason="Requires CUDA")
    @torch.no_grad()
    def test_model_bfloat16(lit_stablelm) -> None:
        from lit_stablelm.utils import EmptyInitOnDevice
    
        block_size = 64
        vocab_size = 32000
        n_layer = 16
        n_head = 16
        n_embd = 32
    
        config = lit_stablelm.StableLMConfig(
            block_size=block_size, vocab_size=vocab_size, n_layer=n_layer, n_head=n_head, n_embd=n_embd
        )
        model = lit_stablelm.StableLM(config)
        model.apply(model._init_weights)
    
        batch_size = 3
        token_sample = torch.randint(0, vocab_size, size=(batch_size, block_size), dtype=torch.int64)
    
        expected = model(token_sample)
    
        with EmptyInitOnDevice(device="cuda", dtype=torch.bfloat16):
            model2 = lit_stablelm.StableLM(config)
        model2.load_state_dict(model.state_dict(keep_vars=True))
    
        out = model2(token_sample.cuda()).float().cpu()
>       torch.testing.assert_close(out, expected, atol=5e-3, rtol=1e-3)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 8281 / 6193152 (0.1%)
E       Greatest absolute difference: 0.010229339823126793 at index (0, 54, 3100) (up to 0.005 allowed)
E       Greatest relative difference: 289.2816162109375 at index (2, 3, 27284) (up to 0.001 allowed)

tests/test_model.py:106: AssertionError
======================================================================================================== short test summary info ========================================================================================================
FAILED tests/test_model.py::test_model_bfloat16 - AssertionError: Tensor-likes are not close!

carmocca avatar May 05 '23 17:05 carmocca

cc @t-vi, maybe you can catch this bug easily as you wrote this test originally

carmocca avatar May 05 '23 17:05 carmocca

I'm not entirely sure, it could be that the tolerance is somewhat tight, but I have not checked in great detail what I'd expect for bf16.

t-vi avatar May 05 '23 19:05 t-vi