alerem18
alerem18
> That's not the intended use for `Flux.train!`. This function is meant to iterate over an entire epoch, not a single batch. Try writing your loop as > > ```julia...
will there be any updates?
whatever it is, it's related to backward path, feed forward path is in flux is already faster than pytorch, or same speed at least
> The layer's documentation for the forward pass says: > > ``` > (mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask]) > ... > mask: Input array broadcastable to size (kv_len, q_len, nheads,...
> @alerem18 which of the two reshaping is correct in your case? reshape(mask, (seq_len, 1, 1, batch_size))
> > @alerem18 which of the two reshaping is correct in your case? > > reshape(mask, (seq_len, 1, 1, batch_size)) however masking is wrong it should be in the shape...
masking with shape (seq_len, 1, 1, batch_size) is ok but with shape (1, seq_len, 1, batch_size) return NaN
> I'm surprised this works at all with the input format given. What does the PyTorch code look like and have you verified it's doing the same thing? what should...
pytorch is quite different, it got a shape of (batch_size, seq_len, features) also i get much worse results by just reshape the data differently: `@cast x_train[i][j, k] := DATA_TRAIN[1][i, j,...
> > pytorch is quite different, it got a shape of (batch_size, seq_len, features) > > Flux supports something very similar. This is why it's important to see the PyTorch...