icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Decoding conformer_ctc trained on TIMIT with ctc-decoding

Open P1nkman1 opened this issue 10 months ago • 22 comments

Hi, i'm trying to train conformer_ctc on timit and training seems to work, but during the decoding with ctc-decoding method during get_lattice() call in decode_one_batch() function whole process hangs and i have to kill it manually. Do you have any ideas on what may be causing this?

P1nkman1 avatar Apr 04 '24 19:04 P1nkman1

Hi,

Would you attach some log so we can locate the issue?

best jin

On Apr 5, 2024, at 03:33, P1nkman1 @.***> wrote:

Hi, i'm trying to train conformer_ctc on timit and training seems to work, but during the decoding with ctc-decoding method during get_lattice() call in decode_one_batch() function whole process hangs and i have to kill it manually. Do you have any ideas on what may be causing this?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42GG4ZIZVT7KI7IEKWTY3WTJ5AVCNFSM6AAAAABFX4B5XGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDMMZVHE2TCMY. You are receiving this because you are subscribed to this thread.

JinZr avatar Apr 07 '24 03:04 JinZr

FYI, according to my experience, if your aed conformer model has not converged well, then the decoding process might be quite slow and resource consuming, training more epochs might help.

best jin

On Apr 7, 2024, at 11:32, Zengrui Jin @.***> wrote:

Hi,

Would you attach some log so we can locate the issue?

best jin

On Apr 5, 2024, at 03:33, P1nkman1 @.***> wrote:

Hi, i'm trying to train conformer_ctc on timit and training seems to work, but during the decoding with ctc-decoding method during get_lattice() call in decode_one_batch() function whole process hangs and i have to kill it manually. Do you have any ideas on what may be causing this?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42GG4ZIZVT7KI7IEKWTY3WTJ5AVCNFSM6AAAAABFX4B5XGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDMMZVHE2TCMY. You are receiving this because you are subscribed to this thread.

JinZr avatar Apr 12 '24 01:04 JinZr

Hi, I am getting the similar issue.

I have trained zipformer with --use-transducer false and --use-ctc true. During decoding, it stuck at the stage of get_lattice in zipformer/ctc_decoding.py

Below are the loss at the end of 10th epoch which I am using for decoding

      Training: loss[loss=0.03422, ctc_loss=0.03422, over 12273.00 frames. ], 
                       tot_loss[loss=0.06499, ctc_loss=0.06499, over 2670542.70 frames. ]
      Validation: loss=0.02084, ctc_loss=0.02084

Below are the library versions torch: 2.2.1 k2: 1.24.4 icefall: pulled on 18th Mar 2024

Thanks!

divyeshrajpura4114 avatar Apr 12 '24 05:04 divyeshrajpura4114

Please attach some stack information after manually terminating the program, so that we can locate the issue. Thanks!Best Regards,JINOn 12 Apr 2024, at 13:50, Divyesh Rajpura @.***> wrote: Hi, I am getting the similar issue. I have trained zipformer with --use-transducer false and --use-ctc true. During decoding, it stuck at the stage of get_lattice in zipformer/ctc_decoding.py Below are the library versions torch: 2.2.1 k2: 1.24.4 icefall: pulled on 18th Mar 2024 Thanks!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

JinZr avatar Apr 12 '24 05:04 JinZr

Please find the attached screenshot for your reference

Screenshot from 2024-04-12 11-24-01

divyeshrajpura4114 avatar Apr 12 '24 06:04 divyeshrajpura4114

Thanks for the screenshot.

However, the screenshot does not show that it gets stuck at get_lattice.

Could you tell us how you found that it gets stuck at get_lattice?

csukuangfj avatar Apr 12 '24 07:04 csukuangfj

I had put down the logging at various stages to figure the step which is causing an issue.

divyeshrajpura4114 avatar Apr 12 '24 07:04 divyeshrajpura4114

Are you able to get the stack trace after it gets stuck in get_lattice by manually pressing ctrl + c?

Also, what is the value of --max-duration and how large is your GPU RAM? Does it work when you reduce --max-duration?

csukuangfj avatar Apr 12 '24 07:04 csukuangfj

Are you able to get the stack trace after it gets stuck in get_lattice by manually pressing ctrl + c?

No, Actually it does not even stop with ctrl + c, need to kill the process with ctrl + z

what is the value of --max-duration

500s

how large is your GPU RAM

Its 48GB and 10GB is occupied during decoding.

Let me try to debug further, if I could get any detailed stack trace.

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 12 '24 08:04 divyeshrajpura4114

@csukuangfj , I have changed the device from GPU to CPU.

With this change, the decoding terminated automatically and below is the stack trace

[F] /var/www/k2/csrc/array.h:176:k2::Array1<T> k2::Array1<T>::Arange(int32_t, int32_t) const [with T = char; int32_t = int] Check failed: start >= 0 (-152562455 vs. 0)

[ Stack-Trace: ]
/opt/conda/lib/python3.10/site-packages/k2/lib64/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7fbc188e49b4]
/opt/conda/lib/python3.10/site-packages/k2/lib64/libk2context.so(k2::Array1<char>::Arange(int, int) const+0x69d) [0x7fbc18e7fded]
/opt/conda/lib/python3.10/site-packages/k2/lib64/libk2context.so(k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int)+0x6a6) [0x7fbc1904a236]
/opt/conda/lib/python3.10/site-packages/k2/lib64/libk2context.so(std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect(k2::DenseFsaVec*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)+0x1e7) [0x7fbc1904d067]
/opt/conda/lib/python3.10/site-packages/k2/lib64/libk2context.so(k2::ThreadPool::ProcessTasks()+0x163) [0x7fbc191ec283]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x7fbccb0c7bf4]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbcd3713ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fbcd37a4a04]

Aborted (core dumped)

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 12 '24 12:04 divyeshrajpura4114

@csukuangfj, @JinZr Just wanted to check if you had a chance to go through the above issue.

divyeshrajpura4114 avatar Apr 15 '24 13:04 divyeshrajpura4114

Below are the library versions
torch: 2.2.1
k2: 1.24.4

Could you tell us the exact k2 version you are using?

csukuangfj avatar Apr 16 '24 01:04 csukuangfj

Screenshot 2024-04-16 at 09 31 50

csukuangfj avatar Apr 16 '24 01:04 csukuangfj

@csukuangfj , Thanks for the response

The version details for the k2 is 1.24.4.dev20240223+cuda12.1.torch2.2.1

divyeshrajpura4114 avatar Apr 16 '24 03:04 divyeshrajpura4114

Looks to me the lattice is so large that the index cannot be reprensted using a int32_t and it gets overflowed.

Could you reduce --max-duration? In the extreme case, you can set --max-duration=1 and increase it incrementally.

csukuangfj avatar Apr 16 '24 04:04 csukuangfj

I have already tried to reduce till 100s, let me reduce further.

divyeshrajpura4114 avatar Apr 16 '24 04:04 divyeshrajpura4114

It is also good to know the exact wave that is causing this error. You can check whether the wave is very long.

csukuangfj avatar Apr 16 '24 04:04 csukuangfj

The audio duration in test data are in a range of 2 to 15s.

Reducing max duration helped. The max value for --max-duration worked for me is 15. If increase further, the decoding gets aborted with Aborted (core dumped) or with the error that I have posted above.

Current GPU memory utilisation is ~47 GB. Isn't this too high?

Thanks for your time and effort.

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 16 '24 05:04 divyeshrajpura4114

Current GPU memory utilisation is ~47 GB. Isn't this too high?

The max value for --max-duration worked for me is 15

Could you tell us which decoding method you are using?

It would be great if you could share the exact command you are using.

csukuangfj avatar Apr 16 '24 06:04 csukuangfj

Sure. The decoding method is ctc-decoding and below is the command which I am using

export CUDA_VISIBLE_DEVICES=0; python3 zipformer/ctc_decode.py --epoch 10 --avg 1 --exp-dir exp/dnn/zipformer_ctc --use-transducer 0 --use-ctc 1 --max-duration 15 --causal 0 --decoding-method ctc-decoding --manifest-dir exp/fbank --lang-dir exp/lang/bpe_5000/ --bpe-model exp/lang/bpe_5000/bpe.model

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 16 '24 07:04 divyeshrajpura4114

--lang-dir exp/lang/bpe_5000/

I see the issue. Your vocab size is 5000, which is way larger than ours 500.

Please change https://github.com/k2-fsa/icefall/blob/ed6bc200e37aaea0129ae32095642c096d4ffad5/egs/librispeech/ASR/zipformer/ctc_decode.py#L662 to

 modified=True, 

and you should be able to use a larger --max-duration.


I suggest that you also try a smaller vocab size.

A larger vocab size does not necessarily imply a better performance.

csukuangfj avatar Apr 17 '24 01:04 csukuangfj

Thanks for your suggestion @csukuangfj.

A larger vocab size does not necessarily imply a better performance.

Will give it a try with reduced vocab size also.

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 17 '24 05:04 divyeshrajpura4114

Changing modified=True, I am successfully able to run decoding on GPU with larger --max-duration.

By reducing vocab to 500, the GPU memory usage is reduced to ~10GB. Further, I observed the degradation of relative ~10% (when decoding with external LM) compared to vocab size of 5000. When I revert modified=False, I can run decoding on GPU with large --max-duration, but again the memory consumption is reached to ~38GB.

If you can provide some detail or reference to understand the underlying concept for using modified, that would be really helpful.

Thanks for your time, effort and suggestions.

Thanks, Divyesh Rajpura

divyeshrajpura4114 avatar Apr 30 '24 05:04 divyeshrajpura4114

If you can provide some detail or reference to understand the underlying concept for using modified, that would be really helpful.

@divyeshrajpura4114 Could you refer to the doc of k2 for that?

csukuangfj avatar May 03 '24 13:05 csukuangfj