attention-lvcsr icon indicating copy to clipboard operation
attention-lvcsr copied to clipboard

some errors when install kaldi-python

Open Entonytang opened this issue 9 years ago • 12 comments

ubuntu 14.04. use thi command(.setup.py install) to setup kaldi-pthon, I have set $KALDI_ROOT already the errors are as follows:

/usr/include/python2.7/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] #warning "Using deprecated NumPy API, disable it by "
^ /usr/bin/ld: /home/jtang/Kaldi/kaldi-trunk/src/matrix/kaldi-matrix.a(kaldi-matrix.o): relocation R_X86_64_32 against .rodata' can not be used when making a shared object; recompile with -fPIC /home/jtang/Kaldi/kaldi-trunk/src/matrix/kaldi-matrix.a: error adding symbols: Bad value collect2: error: ld returned 1 exit status make[1]: *** [kaldi_io_internal.so] Error 1 make[1]: Leaving directory/home/jtang/Attention_ASR/kaldi-python/kaldi_io' make: *** [all] Error 2

thse errors seem to happen in creating kaldi_io_internal.so, if I don't use these .a file $(KALDI_SRC)/matrix/kaldi-matrix.a $(KALDI_SRC)/util/kaldi-util.a $(KALDI_SRC)/base/kaldi-base.a , kaldi_io_internal.so can create(of course this file can't be used)

Entonytang avatar Nov 20 '15 00:11 Entonytang

As far as I remember Kaldi has to be compiled differently for Kaldi-python installation to be successful. @dmitriy-serdyuk , @janchorowski , can you please comment on that?

rizar avatar Nov 20 '15 15:11 rizar

Can you tell me the methods you compile kaldi.......(aslo means : how can you get the file like kaldi_matrix.a.........)

[email protected]

From: Dzmitry Bahdanau Date: 2015-11-20 23:34 To: rizar/attention-lvcsr CC: Entonytang Subject: Re: [attention-lvcsr] some errors when install kaldi-python (#2) As far as I remember Kaldi has to be compiled differently for Kaldi-python installation to be successful. @dmitriy-serdyuk , @janchorowski , can you please comment on that? — Reply to this email directly or view it on GitHub.

Entonytang avatar Nov 20 '15 15:11 Entonytang

Right, sorry, that I didn't mention this. Kaldi should be compile with shared flag:

./configure --shared --use-cuda=no # No need for cuda, we don't train models with kaldi
make

dmitriy-serdyuk avatar Nov 20 '15 16:11 dmitriy-serdyuk

Could you please change the documentation? I guess it makes sense to do it our private repository, since we are going to make what we have there the new master pretty soon.

rizar avatar Nov 20 '15 20:11 rizar

After change the configure command, problem solved...... this steps: $LVSR/bin/run.py train wsj_paper6 $LVSR/exp/wsj/configs/wsj_paper6.yaml this default configuration trains model using CPU.......how to use GPU instead.......

Entonytang avatar Nov 23 '15 10:11 Entonytang

You can use GPU in the same way as you usually do it with Theano. Please read Theano documentation.

On 23 November 2015 at 05:51, Entonytang [email protected] wrote:

After change the configure command, problem solved...... this steps: $LVSR/bin/run.py train wsj_paper6 $LVSR/exp/wsj/configs/wsj_paper6.yaml this default configuration trains model using CPU.......how to use GPU instead.......

— Reply to this email directly or view it on GitHub https://github.com/rizar/attention-lvcsr/issues/2#issuecomment-158903325 .

rizar avatar Nov 23 '15 14:11 rizar

After adding "device =gpu3" while I find GPU Process in GPU 2(device K40).....using default wsj_paper6.yaml..... it costs 65 seconds per steps(1 epoch = 3700 steps), I think this speed is too slow for GPU...... so this speed is right or not , what should I do for speed up the training process and How much time one epoch?

Entonytang avatar Nov 24 '15 03:11 Entonytang

As I measured recently, one step was taking about 6 seconds on a Titan X, K40 was a bit slower, about 8-9 seconds. So probably something goes wrong.

Make sure that Theano writes something like Using gpu device 1: GeForce GTX TITAN X (CNMeM is enabled). Another suggestion is to check that you use float32, not float64. I also use optimizer_excluding=cudnn option since I had some issues with CUDNN.

dmitriy-serdyuk avatar Nov 24 '15 15:11 dmitriy-serdyuk

Also use optimizer=fast_run in your THEANO_FLAGS

On 24 November 2015 at 10:34, dmitriy-serdyuk [email protected] wrote:

As I measured recently, one step was taking about 6 seconds on a Titan X, K40 was a bit slower, about 8-9 seconds. So probably something goes wrong.

Make sure that Theano writes something like Using gpu device 1: GeForce GTX TITAN X (CNMeM is enabled). Another suggestion is to check that you use float32, not float64. I also use optimizer_excluding=cudnn option since I had some issues with CUDNN.

— Reply to this email directly or view it on GitHub https://github.com/rizar/attention-lvcsr/issues/2#issuecomment-159305992 .

rizar avatar Nov 24 '15 15:11 rizar

thanks, solved..... while at the 830 steps. the program stoped without any warnings......while GPU Process is still there.......the bokeh-server is also there. and wsj_paper6.yaml doesn't seems to be the setting in end-to-end attention-based lvcsr...(250 Bi-GRUs in paper while wsj_paper6 has 320)

Epoch 0, step 829 | # | Elapsed Time: 2:09:35


Training status: best_valid_per: 1 best_valid_sequence_log_likelihood: 503.460199693 epochs_done: 0 iterations_done: 829 Log records from the iteration 829: gradient_norm_threshold: 239.912979126 max_attended_length: 400.0 max_attended_mask_length: 400.0 max_recording_length: 1600.0 sequence_log_likelihood: 189.054199219 time_read_data_this_batch: 0.0219719409943 time_read_data_total: 19.5282828808 time_train_this_batch: 11.5933840275 time_train_total: 7709.37198544 total_gradient_norm: 135.73147583 total_step_norm: 1.07967531681

Epoch 0, step 830 | # | Elapsed Time: 2:09:46

Entonytang avatar Nov 25 '15 02:11 Entonytang

Is there exception or a core dump? Otherwise it's something wrong with your OS.

dmitriy-serdyuk avatar Nov 25 '15 18:11 dmitriy-serdyuk

I don't think so, I use another core and try again. the result is similar...... (The best_valid_sequence_log_likelihood: 503.460199693 which is same as the result after 830 steps.) while only pretraining_model.zip| pretraining_log.zip| pretraining.zip appear in wsj_paper6 file. and is the wsj_paper6.yaml is the right config?

Epoch 0, step 84 | #| Elapsed Time: 0:09:18


Training status: best_valid_per: 1 best_valid_sequence_log_likelihood: 503.460199693 epochs_done: 0 iterations_done: 84 Log records from the iteration 84: gradient_norm_threshold: 85.4330291748 max_attended_length: 248.0 max_attended_mask_length: 248.0 max_recording_length: 990.0 sequence_log_likelihood: 264.288513184 time_read_data_this_batch: 0.0211541652679 time_read_data_total: 2.17928504944 time_train_this_batch: 5.36292505264 time_train_total: 556.870803595 total_gradient_norm: 109.950737 total_step_norm: 0.572255551815 while if I use wsj_paper4.jaml, the training process seems to be no problem.......

Entonytang avatar Nov 26 '15 01:11 Entonytang