eesen
eesen copied to clipboard
I runned eesen on gridengine cluster only feature extraction and decoding runned on the cluster
I find in the train_ctc_parallel.sh file that train cmd is train-ctc-parallel. Is this means eesen does not support trin on the sge cluster?
You can submit the running of train_ctc_parallel.sh (for example https://github.com/srvk/eesen/blob/master/asr_egs/wsj/run_ctc_phn.sh#L75) to the scheduler Alternatively, in train_ctc_parallel.sh, you can modify it by following https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141
I followd https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141 , and set nj = number of my gpus (for me I have a executor node which has 3 gpu and nj is set to 3). There are 3 trainning process on the exector node, But all the process used the same gpu. I runned nvidia-smi, the result is as follows:
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 0000:04:00.0 Off | N/A |
| 5% 61C P8 21W / 170W | 15MiB / 4094MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 970 Off | 0000:83:00.0 Off | N/A |
| 45% 61C P8 15W / 170W | 15MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 970 Off | 0000:84:00.0 Off | N/A |
| 20% 68C P2 69W / 170W | 1026MiB / 4095MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 30469 C train-ctc-parallel 335MiB |
| 2 30470 C train-ctc-parallel 336MiB |
| 2 30473 C train-ctc-parallel 335MiB |
+-----------------------------------------------------------------------------+
You should set it to the number of jobs (in your case just 1), instead of the number of GPUs. When you set it to 3, the script will submit the duplicate job for three times.
Is this means the train process can only runs on one node and uses only one gpus?
+1 on @zhangjiulong last question.
Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?
We do have a multi-GPU implementation. Would some of you be available to help test it? We’ll need help in determining the best parameterization (when to average models, how many GPUs, …) unless it can be considered “stable”.
On Aug 28, 2016, at 11:06 PM, Iurii Milovanov [email protected] wrote:
+1 on @zhangjiulong https://github.com/zhangjiulong last question.
Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/81#issuecomment-243022657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8WI6VET5B6DvrHFHBPkEts5OHoTKks5qkkyhgaJpZM4Jcdjr.
Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University
EESEN's current multi-gpu implementation is the script steps/train_ctc_parallel_h.sh, which is based on naive model averaging. It is not stable yet. Some people are working on this from different aspects, but nothing concrete to check into the repos yet.