ColossalAI [BUG] [Shardformer]: Error in blip2 testing with half precision

[BUG] [Shardformer]: Error in blip2 testing with half precision

Open insujang opened this issue 10 months ago • 1 comments

🐛 Describe the bug

It seems blip2 testing doesn't work correctly at all if model is half precision (torch.float16).
With bfloat16, colossalai.shardformer.layer.FusedLayerNorm doesn't seem to work correctly.

https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_shardformer/test_model/test_shard_blip2.py This test file passes as it is.

But if I change dtype to torch.float16: https://github.com/hpcaitech/ColossalAI/blob/89049b0d899477a3b31f02b31fde1a839e31c6fc/tests/test_shardformer/test_model/test_shard_blip2.py#L92

It fails:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5947392 / 5947392 (100.0%)
E       Greatest absolute difference: nan at index (0, 0) (up to 1e-06 allowed)
E       Greatest relative difference: nan at index (0, 0) (up to 1e-05 allowed)

With dtype=torch.bfloat16 and without enable_fused_normalization it passes, but if I enable enable_fused_normalization, it fails again:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 24271 / 2161696 (1.1%)
E       Greatest absolute difference: 0.0078125 at index (0, 3, 47) (up to 1e-05 allowed)
E       Greatest relative difference: 169.0 at index (0, 3, 47325) (up to 1e-05 allowed)

Environment

torch 2.2.1 / CUDA 12.1 colossalai 0.3.6 transformesr 4.36.0

Apr 15 '24 20:04 insujang

I am not sure if it is a bug or an unavoidable error due to lower precision and it was intended to test only on fp32. Would appreciate it if you could share some insights about it. Thanks.

Apr 16 '24 20:04 insujang

ColossalAI ColossalAI copied to clipboard

[BUG] [Shardformer]: Error in blip2 testing with half precision

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard