tree_transformer icon indicating copy to clipboard operation
tree_transformer copied to clipboard

lack of checkpoint files?

Open zxgx opened this issue 5 years ago • 6 comments

Hi, I'm trying to reproduce the result by following the instructions in README.md. After a long time to preprocess the data, I encountered an exception as follows:

========== INFERENCE =================  
Traceback (most recent call last):  
  File "get_last_checkpoint.py", line 44, in <module>  
    args.dir, 1, False, upper_bound=None,  
  File "get_last_checkpoint.py", line 30, in last_n_checkpoint_index  
    raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)  
Exception: ('Found {} checkpoint files but need at least {}', 0, 1)  
GEN_DIR = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000  
GEN_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/infer.avg10.b5.lp1  
AVG_NUM = 10  
LAST_EPOCH =   
AVG_CHECKPOINT_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt  
---- Score by averaging last checkpoints 10 -> /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt  
Generating average checkpoints...  
Namespace(checkpoint_upper_bound=100000000, ema='False', ema_decay=1.0, inputs=['/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default'], num_epoch_checkpoints=10, num_update_checkpoints=None, output='/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt', user_dir='/home/zhg/tree_transformer')  
Traceback (most recent call last):  
  File "../scripts/average_checkpoints.py", line 186, in <module>  
    main()  
  File "../scripts/average_checkpoints.py", line 169, in main  
    args.inputs, num, is_update_based, upper_bound=args.checkpoint_upper_bound,  
  File "../scripts/average_checkpoints.py", line 117, in last_n_checkpoints  
    raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)  
Exception: ('Found {} checkpoint files but need at least {}', 0, 10)

I suppose that some checkpoint files generated during training are missed. Would you please tell me how can I work this out?

zxgx avatar May 04 '20 08:05 zxgx

i have meet the same issue with you. But i also have not deal with this problem. Did you have finish it ?

liuqingpu avatar Jun 05 '20 08:06 liuqingpu

@liuqingpu no

zxgx avatar Jun 08 '20 11:06 zxgx

@liuqingpu no Its difficult for me

liuqingpu avatar Jun 15 '20 07:06 liuqingpu

I have trouble reproducing the results as well, did anyone make it work?

Exception: ('Found {} checkpoint files but need at least {}', 0, 1)

The error sounds like that the script tries inference, but there are no checkpoint files under the experiment directory. Have you managed to get training working?

villmow avatar Aug 25 '20 11:08 villmow

I am also facing same issue. The architecture dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier mentioned in README.md is not getting registered as a fairseq architecture. Prior to the error mentioned by OP, the code throws another error:


fairseq-train: error: argument --arch/-a: invalid choice: 'dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier' (choose from 'fconv_lm', 
'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 
'fconv_wmt_en_fr', 'fconv_self_att', 'fconv_self_att_wp', 'lightconv_lm', 'lightconv_lm_gbw', 'lightconv', 'lightconv_iwslt_de_en', 
'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lstm', 
'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_wiki103', 
'transformer_lm_gbw', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 
'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 
'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en')

The code then continues to perform inference without any training. In the absence of a valid checkpoint it then throws OP's error. Probably looking at https://github.com/nxphi47/tree_transformer/blob/master/src/models/nstack_archs.py#L615 will help. @nxphi47 Could you please help with this?

Shikhar-S avatar Jul 06 '21 04:07 Shikhar-S

I have the same problem, did anyone solve it?

JieYang1020 avatar Jan 10 '22 05:01 JieYang1020