PiPPy [spmd] self-attention not converging

[spmd] self-attention not converging

Open XilunWu opened this issue 3 years ago • 1 comments

What the problem is:

Both single-node and sharded TensorParallelMultiheadAttention(#477) modules diverge (the forward output becomes -inf after less than 10 iterations). Also they produce different forward output of which the relative difference is too small to be captured by self.assertEqual as an inequality.

How to reproduce:

I created a branchad-hoc-self-attn-exp which based on origin/main with a bunch of print statements added to help reproduce the problem.

git checkout origin/ad-hoc-self-attn-exp
pytest test/spmd/tensor/parallel/test_tp_examples.py -s -k test_self_attn_megatron_e2e

Observation:

Both modules produce output increasing from -50 to -inf in 10 iterations.
The output of output.sum() and output_tp.sum() are not exactly identical with a relatively small numeric difference.

Oct 07 '22 01:10 XilunWu

Suggesting to use MLE loss not sum.

Oct 17 '22 17:10 fduwjj

PiPPy PiPPy copied to clipboard

[spmd] self-attention not converging

PiPPy
PiPPy copied to clipboard