ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Zero returns fp16 tensors which causes RuntimeError

Open yuxinyuan opened this issue 3 years ago • 3 comments

🐛 Describe the bug

I run the following script and it reports Found dtype Float but expected Half. It turns out that y_hat is of type fp16, but y and the loss are of type fp32, which cause the error.

Maybe ZeroModel should cast the output back to fp32 instead of leaving it to the user?


import colossalai
import colossalai.nn
import colossalai.utils
import colossalai.zero.init_ctx
import colossalai.zero.shard_utils
import torch
import torch.utils.data
from colossalai.core import global_context as colossal_gpc

IN_DIM, OUT_DIM = 4096, 4096


class MyDS(torch.utils.data.Dataset):
    def __len__(self):
        return 10000

    def __getitem__(self, index):
        return torch.randn(IN_DIM), torch.randn(OUT_DIM)


class MyModel(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = torch.nn.Sequential(
            torch.nn.Linear(IN_DIM, 32768),
            torch.nn.Linear(32768, OUT_DIM),
        )

    def forward(self, x):
        return self.linear(x)


def main():
    config = {
        "zero": {
            "model_config": {
                "shard_strategy": colossalai.zero.shard_utils.BucketTensorShardStrategy(),
                "reduce_scatter_bucket_size_mb": 25,
                "fp32_reduce_scatter": False,
                "tensor_placement_policy": "cuda",
                "gradient_predivide_factor": 1.0,
                "reuse_fp16_shard": True,
            },
            "optimizer_config": {
                "gpu_margin_mem_ratio": 0.5,
                "initial_scale": 2**5,
                "min_scale": 1,
                "growth_factor": 2,
                "backoff_factor": 0.5,
                "growth_interval": 1000,
                "hysteresis": 2,
                "max_scale": 2**32,
            },
        },
    }
    colossalai.launch_from_torch(config)

    ctx = colossalai.zero.init_ctx.ZeroInitContext(
        target_device=colossalai.utils.get_current_device(),
        shard_strategy=colossal_gpc.config.zero.model_config.shard_strategy,
        shard_param=True,
    )

    with ctx:
        model = MyModel()

    ds = MyDS()
    loader = colossalai.utils.get_dataloader(ds, batch_size=128)

    optim = colossalai.nn.HybridAdam(model.parameters())
    criterion = torch.nn.MSELoss()

    engine, loader, _, _ = colossalai.initialize(model, optim, criterion, loader)

    engine.train()
    for i, (x, y) in enumerate(loader):
        x, y = x.cuda(), y.cuda()

        y_hat = engine(x)
        # y_hat = y_hat.float()

        print(f"{y_hat.dtype}, {y.dtype}")
        loss = engine.criterion(y_hat, y)
        print(f"Iteration {i}, loss: {loss.item()}.")

        engine.backward(loss)
        engine.step()
        engine.zero_grad()


if __name__ == "__main__":
    main()

Environment

colossalai version 0.1.5+torch1.11cu10.2

yuxinyuan avatar Jun 09 '22 07:06 yuxinyuan

For some loss fuction, like cross entropy loss, fp16 output is OK. Casting output back to fp32 may increase the memory usage during backward, and loss less precision. You can do this by yourself now, and we are trying to figure out whether casting output back to fp32 is a good idea.

ver217 avatar Jun 10 '22 10:06 ver217

I understand that fp32 would increase memory footprint, but I don't understand why it would be less precise.

yuxinyuan avatar Jun 10 '22 11:06 yuxinyuan

I understand that fp32 would increase memory footprint, but I don't understand why it would be less precise.

loss less precision, I mean, more precise.

ver217 avatar Jun 10 '22 11:06 ver217

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell