VIE icon indicating copy to clipboard operation
VIE copied to clipboard

no models files

Open wzx95789 opened this issue 4 years ago • 6 comments

hello, when i start training directly, i run the run_training.sh, but it shows there is no model.ckpt-50000 file. Can you help me solve this problem? Thanks!

wzx95789 avatar Sep 08 '20 06:09 wzx95789

The training will first train an IR model for 50000 steps. This should be done in the run_training.sh as the first step. I guess that step failed for some reason. You may just need to rerun the run_training.sh to see what happened in the first step.

chengxuz avatar Sep 08 '20 14:09 chengxuz

The training will first train an IR model for 50000 steps. This should be done in the run_training.sh as the first step. I guess that step failed for some reason. You may just need to rerun the run_training.sh to see what happened in the first step.

thanks for your help. when i train an IR model, i met a small problem, the image loading has something wrong, can you help me solve it?Thanks! 微信图片_20200917133218

wzx95789 avatar Sep 17 '20 05:09 wzx95789

From the error message, it seems that you are training the model on CPU and therefore not supporting some operations. Are you training on CPUs? I would not suggest so, as it can take forever for the model to train.

chengxuz avatar Sep 17 '20 13:09 chengxuz

From the error message, it seems that you are training the model on CPU and therefore not supporting some operations. Are you training on CPUs? I would not suggest so, as it can take forever for the model to train.

Thanks for your instructions, i now can use GPUs, but when training the model, the progress bar did not move for a very long time, just as the picture below is shown. I do not know if I miss something operations ? 捕获

wzx95789 avatar Sep 18 '20 13:09 wzx95789

hello, when training an IR model for 50000 steps, the processing bar did not move, i found that the code in the framework.py line170 : res = self.validation_params[val_key]['valid_loop']['func'](self.sess, self.all_val_targets[val_key]), this can not be operated. i do not know if i miss some operations or do something wrong,can you give me some advice?

wzx95789 avatar Sep 25 '20 05:09 wzx95789

sorry for being late here, I don't quite know why this is happening, especially as other people are able to run the validation and training without any problem. And based on your log, your training can be run for at least one step. One thing possible is that you may not have enough computation resources to do the validation loading, like not enough cpus. You can try to vary parameter val_num_workers by setting it to a lower number like 5 through add --val_num_workers 5 to the training commands.

chengxuz avatar Sep 25 '20 12:09 chengxuz