Grégory Châtel
Grégory Châtel
Hi @sharpsy, I also think that this comment would help people understand this code. Do you want to add it yourself or would you like me to create the pull...
The indices from 0 to `n_vocab + n_special` are already taken in the embedding matrix by the vocabulary words (`n_vocab`) and the special tokens such as `_start_`, `_delimiter_` and `_classify_`.
I do not think that you can change the word embedding easily since its dimension must be the same as the output of each layer, in the case of the...
The idea of the OpenAI paper is to use a pretrained network and [transfer](https://en.wikipedia.org/wiki/Transfer_learning) what it knows about language to another task. By doing this, you can obtain really good...
Hi, In the article, the authors use the transpose of the embedding matrix as linear layer just before the softmax layer. This explains the shape of the softmax layer.
As mentioned in #24, the inference head have not been tested yet. I would love to work with you on it. Could you post a bigger piece of code showing...
The pre-trained network is only available in english for now. Multiple issues are open about this in the openAI repo ([#2](https://github.com/openai/finetune-transformer-lm/issues/2) and [#20](https://github.com/openai/finetune-transformer-lm/issues/20)) but no response so far from the...
Hi @artemisart is correct.
There you go (@thomwolf correct me if I'm wrong on any of these): - `n_ctx` is the maximum number of token in an input sequence. - `n_special` is the number...