GraphWriter
GraphWriter copied to clipboard
Exact command line arguments to reproduce results
I have followed the instructions as they are listed in the README, i.e., I ran the following commands verbatim in the root folder of the repository (please correct me if I misunderstood anything):
python3 ./train.py -save trained_weights
python3 generator.py -save=trained_weights/11.vloss-3.562062.lr-0.1
mkdir ../outputs
python3 eval.py ../outputs/11.vloss-3.562062.lr-0.1.inputs.beam_predictions.cmdline data/preprocessed.test.tsv
It got me the following output:
Bleu_1: 17.554950053967314
Bleu_2: 9.937139335974187
Bleu_3: 5.843996614371239
Bleu_4: 3.4978054776396585
METEOR: 7.639675023911272
ROUGE_L: 15.840991976680046
As this is quite worse than the results reported in the paper, I assume I have missed something. As others have reported in the other issues that they were able to reproduce the results, could someone please post their exact command line arguments to do so? I suppose the default learning rate (0.1) is wrong as 0.25 was reported in the paper. However, if the defaults are not the optimal hyperparameters, I am unsure how to achieve the rest of the reported training regime, e.g., the exact scheduling behavior where the learning rate goes down to 0.05 over 5 epochs.
Some other thoughts I had:
- In command (2) I chose epoch 11 because of the smallest validation loss. I understand that this was the procedure for model selection in the paper. Am I wrong here?
- Do I have to set a specific random seed to reproduce results?
I get the same result with the author it seem that you did not use the title while should be add as pargs in the command line you should use the result of 20 epoch rather than 11 epoch and do not change any parameter except the saving place vadilation loss low is not equal to better results
Hi @wutaiqiang , thank you for answering!
If I understand the code correctly, adding -title
as argument would not use the title as input but only the graph/entities. Did I get that wrong?
I assumed that model selection was based on validation loss because this is the only feature visible from the weight file names. Thank you for the hint! Here are the BLEU scores when I use the weights from epoch 20:
Bleu_1: 13.81998084017321
Bleu_2: 7.975161909290924
Bleu_3: 4.785343961925754
Bleu_4: 2.9152031333545163
Unfortunately, they're worse than what I got with epoch 11.
Could you please post what you typed in exactly when you obtained the same results as reported in the paper? It would be very helpful to have a complete documentation of every step necessary to obtain the good results.
Code: https://github.com/rikdz/GraphWriter/blob/3a3a990a230e01004cccfa32e809cd68ebd58288/pargs.py#L26 adding -title as argument would USE the title as input,here is my result:
42.2755 27.9623 19.6838 14.0688
The code that you quoted actually made me believe that adding -title
would, as the help text says, not use title as input, only graph/entities. Thank you for pointing that out ^^
Unfortunately, I get a CUDNN_STATUS_EXECUTION_FAILED error with -title
now. I suppose I am running out of GPU memory. So there is not really anything I can do about it.
Thanks again for your help trying to reproduce the results!
Maybe you can adjust the batch size,it use approximately 10GB gpu memory
I tried -bsz 16
but that did not help. And I'm also afraid that altering the batch size could worsen the final results. Normally, I have nearly 11GB of GPU memory. I am not sure what else could be the reason for the error...
My para:
parser.add_argument("-t1size",default=24,type=int,help="batch size for short targets") parser.add_argument("-t2size",default=16,type=int,help="batch size for medium length targets") parser.add_argument("-t3size",default=6,type=int,help="batch size for long targets")
for you:
I tried
-bsz 16
but that did not help. And I'm also afraid that altering the batch size could worsen the final results. Normally, I have nearly 11GB of GPU memory. I am not sure what else could be the reason for the error...
the '-bsz' is not useful for training, you should adjust the t1size~t3size
CUDNN_STATUS_EXECUTION_FAILED Mean the cuda version and python version(or torchtext .etc) are not suitable,maybe you can use anaconda and reinstall your environment
Thank you! This is very helpful. I will try this!
Hi, @wutaiqiang . Have you compare the performance with and without -title
. Indeed, in my experiment, I found without -title
would have a higher BLEU score(14.37) which is closed to the paper while with -title
lead to lower BLEU score(13.+). I want to confirm this result because the argument also get me confused.
For this original issue, I don't think the -title
setting could lead to a 10 BLEU score decrease. Have you @mnschmit changed the default learning rate? As your performance at eopch 20 is worse than that at epoch 11, sounds like you have changed to a higher learning rate(e.g. 0.25) which cause overfitting.
i get the conclusion'--title means using title' by reading the code rather than compare the result,like:
https://github.com/rikdz/GraphWriter/blob/3a3a990a230e01004cccfa32e809cd68ebd58288/models/newmodel.py#L34-L36 https://github.com/rikdz/GraphWriter/blob/3a3a990a230e01004cccfa32e809cd68ebd58288/models/newmodel.py#L65-L67
i did the experiment just now and using the result of epoch 20 for model with ' -title' and without '-tilte': without '-title':
BLEU: 14.2996 METEOR: 18.7
with '-title':
BLEU: 14.0688 METEOR: 18.8525
hard to believe the result
I have followed the instructions as they are listed in the README, i.e., I ran the following commands verbatim in the root folder of the repository (please correct me if I misunderstood anything):
python3 ./train.py -save trained_weights python3 generator.py -save=trained_weights/11.vloss-3.562062.lr-0.1 mkdir ../outputs python3 eval.py ../outputs/11.vloss-3.562062.lr-0.1.inputs.beam_predictions.cmdline data/preprocessed.test.tsv
It got me the following output:
Bleu_1: 17.554950053967314 Bleu_2: 9.937139335974187 Bleu_3: 5.843996614371239 Bleu_4: 3.4978054776396585 METEOR: 7.639675023911272 ROUGE_L: 15.840991976680046
As this is quite worse than the results reported in the paper, I assume I have missed something. As others have reported in the other issues that they were able to reproduce the results, could someone please post their exact command line arguments to do so? I suppose the default learning rate (0.1) is wrong as 0.25 was reported in the paper. However, if the defaults are not the optimal hyperparameters, I am unsure how to achieve the rest of the reported training regime, e.g., the exact scheduling behavior where the learning rate goes down to 0.05 over 5 epochs.
Some other thoughts I had:
- In command (2) I chose epoch 11 because of the smallest validation loss. I understand that this was the procedure for model selection in the paper. Am I wrong here?
- Do I have to set a specific random seed to reproduce results?
I got the similiar results with you, and i used the '-title' setting, i didn't change the learning rate, it always '0.1'. at epoch 20 the output as follows: bleu_1: 20.36 bleu_2: 9.75 bleu_3: 5.03 bleu_4: 2.77 METEOR: 6.04 ROUGE_L: 13.65 at epoch 11 the output as follows: bleu_1: 21.07 bleu_2: 11.92 bleu_3: 6.95 bleu_4: 4.13 METEOR: 7.90 ROUGE_L: 16.2 Do you know what went wrong ?
Hi, @menggehe .Have you changed the path
according to this issue? As I can remember, I have only modified the code for this issue.
@menggehe maybe you can use generate.py to generate the result.txt and ref.txt[deafut for result.txt only and you should modified the code to get ref.txt], then use the result.txt and ref.txt as the prameter rather than using result.txt and test.tsv
Hi, @menggehe .Have you changed the
path
according to this issue? As I can remember, I have only modified the code for this issue.
yes, i changed the path.
Thanks to @wutaiqiang's comments concering the batch sizes, I was able to run the code with -title
. Here are my BLEU scores for epoch 20:
Bleu_1: 18.633505234444097
Bleu_2: 10.556388209416772
Bleu_3: 6.175432745502417
Bleu_4: 3.676021095744585
So it got a little better but not enough.
@sysu-zjw, I did not modify anything except the path in generator.py
as was recommended in the other issue you linked. Then I used the commands as I showed in my original post.
It is very mysterious why some of us can reproduce the results and some can't... @menggehe, thank you for posting your results, too! It is good to know I am not the only one struggling ^^ Unfortunately, I still do not know what goes wrong though...
what should i do to get the reproduce results
I can't reproduce the same result also. Here is my result running the code with -title
Bleu_1: 18.937064313624802
Bleu_2: 10.804192932667048
Bleu_3: 6.398713632926152
Bleu_4: 3.8613881960224457
METEOR: 7.894808327890248
ROUGE_L: 15.322604643734424
Hi @wutaiqiang , can you provide complete commands so that I can reproduce the same results as the paper? Thanks!
Hi @wutaiqiang , can you provide complete commands so that I can reproduce the same results as the paper? Thanks!
I got a results better than yours, but alse can't get the reproduce results. ('Bleu_1:\t', 32.937958268788414) ('Bleu_2:\t', 22.447175820889868) ('Bleu_3:\t', 16.162693118259686) ('Bleu_4:\t', 11.788624343179919) ('METEOR:\t', 17.268461361418215) ('ROUGE_L:', 27.478546640828654)
the reproduce results is :
Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.
Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.
there ara something you can do to get the reproduce the results
1、first, you shoul use the title information. which is use --title with the command line in train
2、the eval.py command line code has the generation res and ori abstract, but the data/test.csv is not the ori abstract,it is the input data. So you should extract the ori abstract the the eval.py's input file
3、the papar show that is used the 'warm restarts' from 0.25 to 0.05, so you can change the lr at each epoch, which likes this:
if o.param_groups[0]['lr'] > 0.05: o.param_groups[0]['lr'] -= 0.05 if e != 0 and e % 5 == 0 : o.param_groups[0]['lr'] = 0.25
But even you have got the reproduce the results, i think is can't used in your projects, becase the author does'n have provide the post processing code, the generation result have many repeat words, especially copy the entity from graph, there are too many repeat entity in the result。
Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.
there ara something you can do to get the reproduce the results 1、first, you shoul use the title information. which is use --title with the command line in train 2、the eval.py command line code has the generation res and ori abstract, but the data/test.csv is not the ori abstract,it is the input data. So you should extract the ori abstract the the eval.py's input file 3、the papar show that is used the 'warm restarts' from 0.25 to 0.05, so you can change the lr at each epoch, which likes this:
if o.param_groups[0]['lr'] > 0.05: o.param_groups[0]['lr'] -= 0.05 if e != 0 and e % 5 == 0 : o.param_groups[0]['lr'] = 0.25
But even you have got the reproduce the results, i think is can't used in your projects, becase the author does'n have provide the post processing code, the generation result have many repeat words, especially copy the entity from graph, there are too many repeat entity in the result。
Thank you very much for your reply. I was able to get the desired result. I had done all but the second point you made helped me achieve .
I see that the author reported some results that are far less than that reported in the issue
Is that what are you comparing against? If so, the converstation doesn't add up or am I missing something here
~@ahhussein`` for BLEU, the lower the score, the better it is.~ oops i was thinking of perplexity. disregard!