ColossalAI
ColossalAI copied to clipboard
[BUG] [Shardformer]: Error in blip2 testing with half precision
🐛 Describe the bug
- It seems blip2 testing doesn't work correctly at all if model is half precision (torch.float16).
- With bfloat16,
colossalai.shardformer.layer.FusedLayerNorm
doesn't seem to work correctly.
https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_shardformer/test_model/test_shard_blip2.py This test file passes as it is.
But if I change dtype
to torch.float16
:
https://github.com/hpcaitech/ColossalAI/blob/89049b0d899477a3b31f02b31fde1a839e31c6fc/tests/test_shardformer/test_model/test_shard_blip2.py#L92
It fails:
E File "test_shard_blip2.py", line 28, in check_forward_backward
E assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E File "colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E assert_hf_output_close(
E File "colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E assert_close(
E File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E raise error_metas[0].to_error(msg)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 5947392 / 5947392 (100.0%)
E Greatest absolute difference: nan at index (0, 0) (up to 1e-06 allowed)
E Greatest relative difference: nan at index (0, 0) (up to 1e-05 allowed)
With dtype=torch.bfloat16
and without enable_fused_normalization
it passes, but if I enable enable_fused_normalization
, it fails again:
E File "test_shard_blip2.py", line 28, in check_forward_backward
E assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E assert_hf_output_close(
E File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E assert_hf_output_close(
E File "/colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E assert_close(
E File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E raise error_metas[0].to_error(msg)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 24271 / 2161696 (1.1%)
E Greatest absolute difference: 0.0078125 at index (0, 3, 47) (up to 1e-05 allowed)
E Greatest relative difference: 169.0 at index (0, 3, 47325) (up to 1e-05 allowed)
Environment
torch 2.2.1 / CUDA 12.1 colossalai 0.3.6 transformesr 4.36.0
I am not sure if it is a bug or an unavoidable error due to lower precision and it was intended to test only on fp32. Would appreciate it if you could share some insights about it. Thanks.