[BUG] my autocast is not working
Describe the bug
A clear and concise description of what the bug is.
i'm working on https://github.com/YooSungHyun/pytorch-trainer ds_train.py
when i forward deepspeed config fp16, model weight is fp16 but input data is fp32
i know that autocast is make this well, but raised on error like this
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1836, in forward
loss = self.module(*inputs, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/networks/models.py", line 13, in forward
hidden, _ = self.lstm1(inputs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 879, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Input and parameter tensors are not the same dtype, found input tensor with Float and parameter tensor with Half
What did I do wrong?
To Reproduce Steps to reproduce the behavior:
- run scripts/run_train_deepspeed.sh
- raised error
Expected behavior A clear and concise description of what you expected to happen. forward well
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context my zero1 config like this...
{
"fp16": {
"enabled": true,
"auto_cast": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e7,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e7,
"contiguous_gradients": true
},
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}
maybe, that option is working with deepspeed.initialize(training_data=...) only...??
i am not initialize with deepspeed... i'm using torch.utils.data.Dataset and torch's dataloader, not deepspeed wrapper
i given argument to model like model(**batch), but, deepspeed auto_cast is only working *args.
Replacing it with model(batch["inputs"]) worked for me, but I got an error in backward(). I'm also using torch optimizer for the optimizer.
Found dtype Float but expected Half
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1958, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 115, in training_step
model.backward(loss)
File "/data/bart/temp_workspace/pytorch-trainer/trainer/deepspeed.py", line 233, in train_loop
loss = self.training_step(model=model, batch=batch, batch_idx=batch_idx)
File "/data/bart/temp_workspace/pytorch-trainer/trainer/deepspeed.py", line 155, in fit
self.train_loop(
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 583, in main
trainer.fit(
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 606, in <module>
main(args)
RuntimeError: Found dtype Float but expected Half
For auto_cast, I'm using torch.cuda.amp, which I'm sure will work, but will that cause any problems when utilizing offload etc?
with autocast(enabled=True, dtype=torch.float16):
labels = batch.pop("labels")
output = model(batch["inputs"])
loss = self.criterion(output, labels)
same issue here. Don't know if torch.autocast can be used together with deepspeed fp16