THUMT about the time for train a model

how much time for train a model in single 32G GPU and use the defult parameters? I feel it is very slow, in my GPU , it is spend 3.7 seconds for one step. is it normal?

Aug 26 '20 05:08 Rooders

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Aug 26 '20 07:08 GrittyChen

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250. But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

Aug 26 '20 07:08 Rooders

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250. But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

Aug 26 '20 08:08 GrittyChen

@Rooders Please check whether the update_cycle is set to 1, if yes, then I think the training speed is abnormal. Usually, each training step is less than 1 second with the default parameters (model=Transformer,update_cycle=1,device_list=[0],batch_size=4096). The most possible reason is that your training program has run with the CPU rather than the GPU. Please make sure the device_list is set to the index of the GPU you are going to use.

Sorry, my defult parametser are that advicing best parameters in UserManual.pdf . They are update_cycle=4,batch_size=6250. But I just followed your advice and set update_cycle=1,batch_size=4096, device_list=[0],it is still slow, each training step about 2.6 seconds. At this training, My GPU is a single Tesla P40 22G. I have checked this device index and it is available.but it didn't use GPU to training, whether the Tensorflow-version is wrong ? my Tenserflow-Version is tensorflow-gpu=1.15

The THUMT-TensorFlow can be run with TensorFlow-gpu=1.15. You can run a simple Tensorflow-GPU program (maybe a matrix multiplication operation) to check whether it can use the GPU. If not, you should check the CUDA version and the Driver version to make sure they are matched.

thank u very mach, the issue have be solved, it is because CUDA version dosen't match Tensorflow version. By the way, if I set update_cycle=1,batch_size=4096, how many BLEU score I can get in valid corpus? and training model in zh-en 200 millions sentence-pair?

Aug 26 '20 09:08 Rooders

@Rooders Sorry, we did not record the BLEU scores under this setting.

Aug 28 '20 02:08 GrittyChen

THUMT THUMT copied to clipboard

about the time for train a model

THUMT
THUMT copied to clipboard