I am through one of the most complicated and final stage of writing the transformer whole toward the end of ch10. Tons of typos are corrected and now, this line is giving error:

encoded = scaled_x + self.pe[:,:x.size(1),:]
          ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 2

This is from positionalEncoding.forward() class and from full sequence example given starting in around page 433, it is not working. Now, i see same PE class works OK with minimal example given in around p416 which i called layernorm example: ch10-p416-layernorm-example.py when the sizes are such:

full_train=torch.as_tensor(points).float() PositionalEncoding.forward() entered...

['scaled_x'] : <class 'torch.Tensor'> torch.Size([3, 2, 4])

['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 2, 4])

['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 4])

However from example in p433, as you can see, sizes has become as follows during decode stage (did not see in encode stage:

PositionalEncoding.forward() entered...

['scaled_x'] : <class 'torch.Tensor'> torch.Size([16, 2, 2])

['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 100, 6])

['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 6])

I see self_pe (which is actually self.pe) has formed into 1,100,6 possibly due to initialization in prob'ly DecoderTransf: with max_len=100, d_model=6.

I see from smaller example, addition can take place as long as dim 2 and 3 equal ant not necessarily however from bigger example (full_seq), it seems tobe failing due to both dim2 and 3 is different. I am still looking around for the weaklink, in which If I find, I will close this issue. However, if there is a typo in book, pls let me know.

Aug 18 '25 11:08 gggh900

i see this did not happen during encode because scaled_x is 16,2,6 not 16,2,2 in decode.

$ egrep -irn "self_pe|entered|error|scaled_x" dbg2.log 6:DBG: EncoderLayer.init() entered... 16:DBG: MultiHeadAttention.init() entered. 23:DBG: MultiHeadAttention.init() entered. 30:DBG: MultiHeadAttention.init() entered. 39:DBG: EncoderTransf.forward() entered... 43:PositionalEncoding.forward() entered... 45:['scaled_x'] : <class 'torch.Tensor'> torch.Size([16, 2, 6]) 48:['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 100, 6]) 51:['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 6]) 53:DBG: EncoderLayer.forward() entered... 67:DBG: MultiHeadAttention.forward entered... 71:DBG: MultiHeadAttention.attn() entered... 75:DBG: MultiHeadAttention.score_function() entered... 99:DBG: EncoderLayer.forward() entered... 113:DBG: MultiHeadAttention.forward entered... 117:DBG: MultiHeadAttention.attn() entered... 121:DBG: MultiHeadAttention.score_function() entered... 145:DBG: DecoderTransf.forward() entered... 149:PositionalEncoding.forward() entered... 151:['scaled_x'] : <class 'torch.Tensor'> torch.Size([16, 2, 2]) 154:['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 100, 6]) 157:['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 6]) 199: encoded = scaled_x + self.pe[:,:x.size(1),:] 201:RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 2

Aug 18 '25 11:08 gggh900

I changed the d_model from 6 to 2 and got rid but now into different error, ohh, this is exhausting... :(

Layers

-enclayer = EncoderLayer(n_heads=3, d_model=6, ff_units=10, dropout=0.1) -declayer = DecoderLayer(n_heads=3, d_model=6, ff_units=10, dropout=0.1) +enclayer = EncoderLayer(n_heads=3, d_model=2, ff_units=10, dropout=0.1) +declayer = DecoderLayer(n_heads=3, d_model=2, ff_units=10, dropout=0.1)

File "/home/nonroot/gg/git/codelab/gpu/ml/tf/tf-from-scratch/3/code-exercises/ch10/../common/models/nmha.py", line 43, in make_chunks x = x.view(batch_size, seq_len, self.n_heads, self.d_k) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape '[16, 2, 3, 0]' is invalid for input of size 64 [nonroot@localhost ch10]$ git diff

Aug 18 '25 11:08 gggh900

Hi @ggghamd ,

Apologies for the delayed response. I totally understand your frustration, and I agree that dealing with the shapes in the attention/Transformer is indeed exhausting. I also felt like that when I was trying to understand it for the first time.

I believe the issue you have is due to a missing transformation/projection. The input data (full_train) has only 2 features (the coordinates of the point). But PE operates on the same dimensionality of the model (6, in this case).

If you try the code below, it will raise the error you mentioned:

points, directions = generate_sequences(n=256, seed=13)
full_train = torch.as_tensor(points).float()
print(full_train.shape)

d_model=6
pe = PositionalEncoding(100, d_model)
pe(full_train).shape

Output

torch.Size([256, 4, 2])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 2

You will find the missing transformation - the ones that turns 2 features into 6 - in the EncoderDecoderTransf class, in the encode() and decode() methods:

    def encode(self, source_seq, source_mask=None):
        # Projection
        source_proj = self.proj(source_seq)  # --> you're missing this part somehow
        encoder_states = self.encoder(source_proj, source_mask)
        self.decoder.init_keys(encoder_states)

The self.proj layer is a linear layer: proj = nn.Linear(n_features, encoder.d_model). If we include it in the example, it will work:

points, directions = generate_sequences(n=256, seed=13)
full_train = torch.as_tensor(points).float()
print(full_train.shape)

d_model=6
proj = nn.Linear(2, d_model)
projected = proj(full_train)
print(projected.shape)

pe = PositionalEncoding(100, d_model)
pe(projected).shape

Output

torch.Size([256, 4, 2])
torch.Size([256, 4, 6])
torch.Size([256, 4, 6])

Another detail: in your last comment, you modified the dimensionality of the model. It must always be a multiple of the number of heads. So, if we have 3 heads, we could never have 2 dimensions. That's why the projection is made into 6 dimensions.

I hope it helps! Let me know if you need anything else.

Best, Daniel

Aug 26 '25 15:08 dvgodoy

Thx, it has been a while so may take a while for me to re-tune in PE and look into your response, thx! It was not frustrating, although indeed exhausting at times, however I am largely satisfied with ur book. It has given me a most deep-dive view of transformer that no other literature can come close. I think i got very good understanding of attention mechanism, I may at most file 1-2 issues in the near future. Good work!!

Nov 26 '25 04:11 gggh900

Positional encoding potential size issue

full_train=torch.as_tensor(points).float() PositionalEncoding.forward() entered...

['scaled_x'] : <class 'torch.Tensor'> torch.Size([3, 2, 4])

['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 2, 4])

['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 4])

PositionalEncoding.forward() entered...

['scaled_x'] : <class 'torch.Tensor'> torch.Size([16, 2, 2])

['self_pe'] : <class 'torch.Tensor'> torch.Size([1, 100, 6])

['self_pe_1'] : <class 'torch.Tensor'> torch.Size([1, 2, 6])

Layers