treelstm.pytorch
treelstm.pytorch copied to clipboard
Two differences from the original implementation
I got the same result as you, ~0.846
Pearson score. After checking the original implementation, I found two differences.
- In your trainer.py file,
def train(self, dataset):
self.model.train()
self.optimizer.zero_grad()
loss, k = 0.0, 0
indices = torch.randperm(len(dataset))
for idx in tqdm(range(len(dataset)),desc='Training epoch '+str(self.epoch+1)+''):
ltree,lsent,rtree,rsent,label = dataset[indices[idx]]
linput, rinput = Var(lsent), Var(rsent)
target = Var(map_label_to_target(label,dataset.num_classes))
if self.args.cuda:
linput, rinput = linput.cuda(), rinput.cuda()
target = target.cuda()
output = self.model(ltree,linput,rtree,rinput)
err = self.criterion(output, target)
loss += err.data[0]
err.backward() # <------------
k += 1
if k%self.args.batchsize==0:
self.optimizer.step()
self.optimizer.zero_grad()
self.epoch += 1
return loss/len(dataset)
You call .backward() for each sample in the mini-batch, and then perform one step update with self.optimizer.step(). Since the backward() function accumulate the gradients automatically, it seems you need to average both the losses and the gradients over the mini-batch. So I think the arrow line above should be changed to
(err/self.args.batchsize).backward()
- The original implementation does not really update the embeddings. It does not include the embedding parameters into the model, and all the parameters of the model are optimized with Adagrad. It updates the embedding parameters with the gradients*learning_rate directly, but the learning_rate is set to
0
. Furthermore, I did some simple calculations. The number of embedding parameters is more than700000
, and286505
for the other model parameters. Consider the size of the training set is just4500
, it is too small to fine-tune the embeddings.
After I made the two above modifications, I can get 0.854
Pearson score and 0.274
MSE with Adagrad(learning_rate=0.05
)
yes in section 5.3 the paper said
For the semantic relatedness task, word representations were held fixed as we did not observe any significant improvement when the representations were tuned
Hi @wangxin0716 and @ryh95 ,
As you have pointed out, the original paper mentions freezing the word embeddings. I had overlooked this, but have rectified my mistake and incorporated this via commit which adds the option of freezing the word embeddings during training. This results in a slight improvement to the metrics, and we can now reach Pearson's coefficient of 0.8674
and MSE of 0.2536
.
We are now within ~0.0005
of the original paper, albeit with a different learning rate, so I do not really know if there is any way left to exactly match the numbers. Different libraries, platforms, OS, etc. might account for numerical precision differences within this ballpark.
BTW, @wangxin0716 , I also tried the change you suggested, i.e. (err/self.args.batchsize).backward()
, however, I ended up getting better final metrics keeping it as is. I believe this should not matter as much, since this is a simple scaling of the gradient and can be effectively achieved using a different learning rate to the same effect.
I run with parameter --lr 0.025 --wd 0.0001 --optim adagrad --batchsize 25 --freeze_embed, however, the result is 0.857, 0.01 less than what it is supposed to be. What could possibly caused the situation?
Thanks for the code. That was very helpful in understanding the paper. I ran the code with the following configuration :
Namespace(batchsize=25, cuda=False, epochs=50, expname='master', freeze_embed=True, hidden_dim=50, input_dim=300, lr=0.025, mem_dim=150, num_classes=5, optim='adagrad', save='checkpoints/', seed=123, sparse=False, wd=0.0001)
and got the best result at 5th epoch:
Epoch 5, Test Loss: 0.10324564972114664 Pearson: 0.8587949275970459 MSE: 0.2709934413433075
which is less than what is claimed. Could you please suggest, what I could be doing wrong? Is there anyone else facing the same issue? Thanks