eesen icon indicating copy to clipboard operation
eesen copied to clipboard

I runned eesen on gridengine cluster only feature extraction and decoding runned on the cluster

Open zhangjiulong opened this issue 8 years ago • 7 comments

I find in the train_ctc_parallel.sh file that train cmd is train-ctc-parallel. Is this means eesen does not support trin on the sge cluster?

zhangjiulong avatar Aug 04 '16 08:08 zhangjiulong

You can submit the running of train_ctc_parallel.sh (for example https://github.com/srvk/eesen/blob/master/asr_egs/wsj/run_ctc_phn.sh#L75) to the scheduler Alternatively, in train_ctc_parallel.sh, you can modify it by following https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141

yajiemiao avatar Aug 04 '16 15:08 yajiemiao

I followd https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141 , and set nj = number of my gpus (for me I have a executor node which has 3 gpu and nj is set to 3). There are 3 trainning process on the exector node, But all the process used the same gpu. I runned nvidia-smi, the result is as follows:

| NVIDIA-SMI 361.42     Driver Version: 361.42         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     Off  | 0000:04:00.0     Off |                  N/A |
|  5%   61C    P8    21W / 170W |     15MiB /  4094MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 970     Off  | 0000:83:00.0     Off |                  N/A |
| 45%   61C    P8    15W / 170W |     15MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 970     Off  | 0000:84:00.0     Off |                  N/A |
| 20%   68C    P2    69W / 170W |   1026MiB /  4095MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     30469    C   train-ctc-parallel                             335MiB |
|    2     30470    C   train-ctc-parallel                             336MiB |
|    2     30473    C   train-ctc-parallel                             335MiB |
+-----------------------------------------------------------------------------+

zhangjiulong avatar Aug 05 '16 03:08 zhangjiulong

You should set it to the number of jobs (in your case just 1), instead of the number of GPUs. When you set it to 3, the script will submit the duplicate job for three times.

yajiemiao avatar Aug 05 '16 03:08 yajiemiao

Is this means the train process can only runs on one node and uses only one gpus?

zhangjiulong avatar Aug 05 '16 03:08 zhangjiulong

+1 on @zhangjiulong last question.

Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?

iurii-milovanov avatar Aug 29 '16 03:08 iurii-milovanov

We do have a multi-GPU implementation. Would some of you be available to help test it? We’ll need help in determining the best parameterization (when to average models, how many GPUs, …) unless it can be considered “stable”.

On Aug 28, 2016, at 11:06 PM, Iurii Milovanov [email protected] wrote:

+1 on @zhangjiulong https://github.com/zhangjiulong last question.

Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/81#issuecomment-243022657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8WI6VET5B6DvrHFHBPkEts5OHoTKks5qkkyhgaJpZM4Jcdjr.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

fmetze avatar Aug 29 '16 13:08 fmetze

EESEN's current multi-gpu implementation is the script steps/train_ctc_parallel_h.sh, which is based on naive model averaging. It is not stable yet. Some people are working on this from different aspects, but nothing concrete to check into the repos yet.

yajiemiao avatar Aug 30 '16 06:08 yajiemiao