貌似torch.autocast和deepspeed不能直接融合,运行实例会报错
Describe the bug A clear and concise description of what the bug is. 使用deepspeed训练examples/aishell/whisper,会报错:
python3.10/site-packages/deepspeed/runtime/torch_autocast.py", line 97, in validate_nested_autocast raise AssertionError( AssertionError: torch.autocast is enabled outside DeepSpeed, but not in the DeepSpeed config. Please enable torch.autocast through the DeepSpeed config to ensure the correct communication dtype is used.
修改batch_forward函数,去掉with autocast, 可以运行,但是出现数据类型问题: ch/nn/modules/conv.py", line 370, in _conv_forward return F.conv1d( RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
手动改了输入类型bf16,输出loss会有新的问题 RuntimeError: "ctc_loss_cuda" not implemented for 'BFloat16'
torch version: 2.6.0 deepspeed: 0.17.5
deepspeed config 是什么
deepspeed config 是什么
用的代码中的conf/ds_stage1.json , { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 1, "steps_per_print": 100, "gradient_clipping": 5, "fp16": { "enabled": false, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "consecutive_hysteresis": false, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "zero_force_ds_cpu_optimizer": false, "zero_optimization": { "stage": 1, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients" : true } }
解决了吗?遇到同样问题