sagemaker-debugger
sagemaker-debugger copied to clipboard
Loss Tensors Are Saved Twice On AWS Pytorch
loss functional values are saved twice for each step with AWS Pytorch.
This happens because functional losses are saved by default by the post_hook_for_loss_functional fn in AWS Pytorch.
...
for _ in range(n_steps):
optimizer.zero_grad()
outputs = net(inputs)
loss = F.cross_entropy(outputs, labels)
hook.record_tensor_value("nll_loss", tensor_value=loss)
loss.backward()
optimizer.step()
...
The post_hook_for_loss_functional is called by F.cross_entropy(outputs, labels)
this happens when, like the test https://github.com/awslabs/sagemaker-debugger/blob/master/tests/pytorch/test_loss.py#L61, user makes use of AWS-pytorch, but also modifies the training script to manually save loss.