Shuai Zheng
Shuai Zheng
@tjruwase DeepSpeed initializes the FP32 master weights in the engine when `deepspeed.initialize` is called. After that we use deepspeed `load_checkpoint` to load the model weights without giving the zero state...
@tjruwase Yes. There are three reasons: 1) we may add additional layers in the finetuning, in which case the shape of the partition does not align so that we fail...
@tjruwase This is what I originally did to make it work. But I think this is frustrating if `load_checkpoint` cannot handle such case.
@tjruwase there is one more possible bug we found yesterday that zero 2 gives much higher accuracy than zero 1 in the finetuning (all the hyperparameters are the same except...
@tjruwase I will see if I can reproduce ZeRO-1 regression using public available dataset. Yes. That description captures my case. Also, I would like to get your attention on my...
@liuzh91 once we have the mining tool, we can also extract data from different domains and use them for the research.
@pengxin99 Yes, the gradient needs to be averaged by the total number of tokens across all the GPUs.
@sxjscience it seems the error `AttributeError: module 'mxnet.ndarray.numpy_extension' has no attribute 'sldwin_atten_score'` is due to that the mxnet version is not the latest.
I think gluonnlp's vocab has different order from the one of huggingface. So it will be problematic if we simply copy the entire embedding matrix. @eric-haibin-lin.
i think @eric-haibin-lin means that `embedding_gluonnlp[2] \not= embedding_hf[101]`, as you simply copied the embedding matrix without reordering.