Varuna Jayasiri
Varuna Jayasiri
Looks like the number of tokens is different from the number of token when it was training. Did you change the dataset or run BPE again?
This looks sounds like a bug. The dimensions of the embedding weights are number of tokens and number of embedding features (d_model)
I will give it a try and see if I can reproduce. Are you running the latest master? Did you make changes? Also is the dataset the same?
Thanks, you are right! That's a typo and a big bug!
This is strange. I guess the wrong softmax also provides a similar non-linearity to the correct softmax and gradient descent finds a way to use it. But I don't understand...
Should we do that or ``` final_output[indexes_list[i], :] = expert_output[i].to(x.dtype) ``` Because it seems like you changed expert to `bfloat16`, while the transformer general processing was in `float32`, and you...
The feature map size doesn't change. Can you please point to the comment that mentions it does? The [blocks are in a `nn.Sequential`](https://nn.labml.ai/resnet/index.html#section-63)
Yeah that is correct. The feature map size stays the same. I meant the comment that you said didn't match with the code when I asked for the comment.
我们将尝试添加自动翻译
我们将评论机器翻译成中文 https://nn.labml.ai/zh/