apex
apex copied to clipboard
Regression: tensor_parallel.ColumnParallelLinear fails on onnx.export
Describe the Bug
Here, while exporting one of Nemo Megatron modules that use tensor_parallel.ColumnParallelLinear. Happens with ToT. This used to work with previous releases. Apparently, the problem is that, inference/no-grad forward execution path still contains LinearWithGradAccumulationAndAsyncAllreduce AutogradFunction's forward() - which by design won't export.
E0408 21:46:43.917169 140336425469760 export.py:160] Export failed. Please make sure your NeMo model class (nemo.collections.nlp.models.question_answering.qa_model.QAModel) has working export
() and that you have the latest NeMo package installed with [all] dependencies.
Traceback (most recent call last):
File "/git/NeMo/scripts/export.py", line 176, in
Defined at: /opt/conda/lib/python3.8/site-packages/apex/transformer/tensor_parallel/layers.py(315): linear_with_grad_accumulation_and_async_allreduce
Expected Behavior
Environment
This was my quick workaround - to replace instances of tensor_parallel.ColumnParallelLinear with my wrapper class below. Something like that should be implemented inside tensor_parallel.ColumnParallelLinear.forward instead:
class ColumnLinear(tensor_parallel.ColumnParallelLinear):
# redefine forward only for non-parallel inference
def forward(self, input_):
world_size = get_tensor_model_parallel_world_size()
if input_.requires_grad or world_size > 1:
return tensor_parallel.ColumnParallelLinear.forward(self, input_)
bias = self.bias if not self.skip_bias_add else None
# Matrix multiply.
output = torch.matmul(input_, self.weight.t())
if not self.skip_bias_add:
output = output + self.bias
output_bias = self.bias if self.skip_bias_add else None
return output, output_bias
seems related to https://github.com/NVIDIA/NeMo/pull/3998