albert_pytorch issues

TF_WEIGHTS_NAME = 'model.ckpt'里面的model.ckpt文件在哪里？

![image](https://user-images.githubusercontent.com/56639568/92233978-d1056500-eee3-11ea-86d4-4c8bc66d9a06.png)

What is the relationship between learning rate and BERT model size (especially the depth)

In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also...

wsn555

训练过程中albert占用的显存很大

3

你好：我使用同样的数据pipeline训练QA模型，使用bert-wwm的时候可以设置batchsize到12，使用albert-xxlarge-v2只能设置batchsize到6。但是albert-xxlarge-v2的模型文件本身只有900M左右而bert-wwm的模型文件有1400M，请问有什么可能的原因造成这种情况吗？

fatmelon

将brightmart中预训练的TensorFlow模型转换为Pytorch模型报如下错误，'Embedding' object has no attribute 'shape'

INFO:model.modeling_albert_bright:Initialize PyTorch weight ['bert', 'embeddings', 'position_embeddings'] INFO:model.modeling_albert_bright:Skipping bert/embeddings/position_embeddings/lamb_m Traceback (most recent call last): File "convert_albert_tf_checkpoint_to_pytorch.py", line 59, in args.pytorch_dump_path) File "convert_albert_tf_checkpoint_to_pytorch.py", line 34, in convert_tf_checkpoint_to_pytorch load_tf_weights_in_albert(model, config, tf_checkpoint_path) File "/data/albert_pytorch-master/model/modeling_albert_bright.py",...

wyqnumber

ALBERT-small训练效果问题

您好，我尝试在您训练的albert_small基础之上，使用金融语料预训练albert_small。碰到问题：在10万金融语料上训练后，即使再增加数，模型精度也不再提升，损失也不再下降。当使用原先的学习率（0.000176）会发散，学习率我已经降低到1e-5和1e-6，但是学习效果仍然止步不前。我训练的albert_small效果如下：训练精度只有57和68。 1）您是否可以分享一下，albert_small训练的效果？ 2）对于提升预训练效果您是否可以分享一些经验？

Hanlard

global_step的位置导致多次执行无谓的evaluate()

https://github.com/lonePatient/albert_pytorch/blob/e9dbe3ce9aa49e787774b050cbdc496046e0c5bf/run_classifier.py#L110-L122 以上是run_classifier.py line110-122的代码。假如`args.gradient_accumulation_steps`取默认值1，则不会有任何问题；然而当设置`args.gradient_accumulation_steps`为其他值时，以`4`为例，外循环的前3步（即step=0~3）就无法通过line110的if判断，从而导致global_step一直为0，然后导致line116的if判断基本总能通过（缘由`global_step=0`时，`global_step % args.logging_steps == 0`恒成立），最终导致还没开始梯度更新，就做了3次无谓的`evaluate`。所以这里可能存在一些瑕疵，我理解的是，这里的变量`global_step`应与line78```logger.info(" Total optimization steps = %d", num_training_steps)```中的num_training_steps保持一致，每进行一次梯度更新，代表实际上一个batch的数据被计算了一遍，`global_step`才+1，这也是`train()`函数最终返回的loss=`tr_loss / global_step`的原因。所以我想是否可以直接在line116、line121的判断上加一个限制`global_step != 0`，我想这样大概就可以暂时解决该问题了。

illusions-LYY

RuntimeError: Error(s) in loading state_dict for BertModel

1

RuntimeError: Error(s) in loading state_dict for BertModel: size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([21128, 128]) from checkpoint, the shape in current model is torch.Size([21128, 312]). ======================== Can...

ZiteHe

How can I do MLM task on new dataset with new vocab.txt utilizing AlbertForPreTraining with pretrained weights?

Hi~ I want to utilize _AlbertForPreTraining_ to do MaskedLM task on new datasets(with **new vocab.txt**, whose size is not 21128) **based on the pretraining weights**. How can I do that...

ChineseYjh