Max Ma
Max Ma
This is the initialization to ensure the posterior distribution as a normal. During training, the posterior distribution will become more and more complex when we update the parameters.
Hi, For a "large: model with around 3 billion parameters, I guess the optimizer is probably not the bottleneck of memory comparing with gradient calculation in back-propagation. Can I ask...
Thanks for the updates! If I understand correctly, storing the parameters together with the optimizer states is indeed the bottleneck of memory. Since apollo has one more state (3 vs....
Please let me know if you find apollo obtains better results on the large model. Thanks!
Thanks a lot for your reply. It is pretty clear! On Wed, Dec 5, 2018 at 8:20 AM Huadong Liao wrote: > Hope my explanation will help you: > >...
"DOCSTART" in my data sets is placed in a separated sentence, like 1 -DOCSTART- -X- O O But as it provide no useful information, you can remove it from your...
@nrasiwas sorry for late response. Here is a more clear example of the data format. The following is the correct format for your examples: 1 EU NNP I-NP I-ORG 2...
The second column is reserved for lemma, the same as conllu. But our model does not use lemma information. So the second column can be filled with any thing. Our...
Hi, the data is under PTB licence. If it is not an issue, it is good for me to send you the data. Can you give me your email?
For CoNLL-x format, the schema is: ID, FORM, LEMMA, CPOSTAG, POSTAG, MORPH-FEATURES, HEAD, DEPREL, PHEAD, PDEPREL For NER data, the schema is: ID, FORM, POSTAG, CHUNK, NERTAG