tree_transformer
tree_transformer copied to clipboard
lack of checkpoint files?
Hi, I'm trying to reproduce the result by following the instructions in README.md. After a long time to preprocess the data, I encountered an exception as follows:
========== INFERENCE =================
Traceback (most recent call last):
File "get_last_checkpoint.py", line 44, in <module>
args.dir, 1, False, upper_bound=None,
File "get_last_checkpoint.py", line 30, in last_n_checkpoint_index
raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)
Exception: ('Found {} checkpoint files but need at least {}', 0, 1)
GEN_DIR = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000
GEN_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/infer.avg10.b5.lp1
AVG_NUM = 10
LAST_EPOCH =
AVG_CHECKPOINT_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt
---- Score by averaging last checkpoints 10 -> /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt
Generating average checkpoints...
Namespace(checkpoint_upper_bound=100000000, ema='False', ema_decay=1.0, inputs=['/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default'], num_epoch_checkpoints=10, num_update_checkpoints=None, output='/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt', user_dir='/home/zhg/tree_transformer')
Traceback (most recent call last):
File "../scripts/average_checkpoints.py", line 186, in <module>
main()
File "../scripts/average_checkpoints.py", line 169, in main
args.inputs, num, is_update_based, upper_bound=args.checkpoint_upper_bound,
File "../scripts/average_checkpoints.py", line 117, in last_n_checkpoints
raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)
Exception: ('Found {} checkpoint files but need at least {}', 0, 10)
I suppose that some checkpoint files generated during training are missed. Would you please tell me how can I work this out?
i have meet the same issue with you. But i also have not deal with this problem. Did you have finish it ?
@liuqingpu no
@liuqingpu no Its difficult for me
I have trouble reproducing the results as well, did anyone make it work?
Exception: ('Found {} checkpoint files but need at least {}', 0, 1)
The error sounds like that the script tries inference, but there are no checkpoint files under the experiment directory. Have you managed to get training working?
I am also facing same issue. The architecture dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier mentioned in README.md is not getting registered as a fairseq architecture. Prior to the error mentioned by OP, the code throws another error:
fairseq-train: error: argument --arch/-a: invalid choice: 'dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier' (choose from 'fconv_lm',
'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de',
'fconv_wmt_en_fr', 'fconv_self_att', 'fconv_self_att_wp', 'lightconv_lm', 'lightconv_lm_gbw', 'lightconv', 'lightconv_iwslt_de_en',
'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lstm',
'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_wiki103',
'transformer_lm_gbw', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de',
'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big',
'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en')
The code then continues to perform inference without any training. In the absence of a valid checkpoint it then throws OP's error. Probably looking at https://github.com/nxphi47/tree_transformer/blob/master/src/models/nstack_archs.py#L615 will help. @nxphi47 Could you please help with this?
I have the same problem, did anyone solve it?