Rajarshee Mitra
Rajarshee Mitra
Also, if it doesn't work for the initial set up I described, I would still be happy to get a solution for the scenario where all dimensions are known !
Building from source was a workaround for me.
Do you have some comparison between BERT and w/o BERT on NMT tasks?
Need attention, please!
It's kind of strange that the hidden states are, by default, not exposed :/
"I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id." It looks very undesirable but then...
@oahziur I don't get your first point. How can we compute the hidden states of all steps in the first place without using the output layer and taking the argmax...
Yeah, I get your point. But if I am not using teacher forcing (or using GreedyEmbedingHelper), I would want my predicted ids to be used. And for that to happen,...
So, I need to hack my way into it to use output_layer as a part of the decoder and also make the dynamic_decode return hidden states. Any suggestions about what...
This is what I did: added a new attribute ```final_output``` in ```BasicDecoderOutput``` namedtuple that shall store projected outputs whenever there is an ```output_layer``` in ```BasicDecoder```. In the ```step()``` of ```BasicDecoder```,...