icefall C++ hlg_decode inference is very slow for the first three times

When loading the HLG model prediction for batch inference, the speed is very slow for the first three times, but the speed is normal after the third time. Why is this?

Sep 16 '22 07:09 mn7026

C++ hlg_decode inference

Which code are you using?

Sep 16 '22 07:09 csukuangfj

k2 branch v2.0-pre the code is: https://github.com/k2-fsa/k2/blob/v2.0-pre/k2/torch/bin/hlg_decode.cu

Sep 16 '22 08:09 mn7026

Could you add the following three lines

  torch::jit::getExecutorMode() = false;
  torch::jit::getProfilingMode() = false;
  torch::jit::setGraphExecutorOptimize(false);

to https://github.com/k2-fsa/k2/blob/v2.0-pre/k2/torch/bin/hlg_decode.cu#L104 and try it again?

Sep 16 '22 08:09 csukuangfj

Thank you very much, the problem is solved perfectly

Sep 16 '22 09:09 mn7026

Hi,I made this program into a web service, but the qps is very low, and the time-consuming is much longer than the local prediction， Why is this?

Sep 20 '22 05:09 mn7026

torch::set_num_threads(1);Will this code make any affect?

Sep 20 '22 05:09 mn7026

What changes have you made?

Sep 20 '22 05:09 csukuangfj

Only add this three lines codes: torch::jit::getExecutorMode() = false; torch::jit::getProfilingMode() = false; torch::jit::setGraphExecutorOptimize(false);

Sep 20 '22 05:09 mn7026

I made this program into a web service, but the qps is very low

Do you recreate the model for each new request?

Sep 20 '22 05:09 csukuangfj

No, only create once

Sep 20 '22 05:09 mn7026

How many models are you using to serve client requests? Do you use multiple threads? Do you use batch processing?

Sep 20 '22 05:09 csukuangfj

No,only use one model,the thread is torch::set_num_threads(1),I don't use batch

Sep 20 '22 05:09 mn7026

torch::set_num_threads(1)

It sets the number of threads used by the model.

I suggest that you create a threadpool, where each thread handles one request.

Sep 20 '22 05:09 csukuangfj

I have a question that there is no thread pool in wenet, but online deployment is normal，but I will try again with your method

Sep 20 '22 09:09 mn7026

Maybe we need to know more about your setup. E.g. are you using GPU the same way in local deployment vs. web service? is it possible that one is using it but the other not? Are you actually handling multiple parallel streams, or just one? If one, threadpool may not matter.

Sep 20 '22 13:09 danpovey

The GPU is in the same way.But I'm handling multiple parallel streams,the local deployment is handling one by one.

Sep 20 '22 15:09 mn7026

I'm handling multiple parallel streams in web service,it was normal at the beginning, but after a few minutes it suddenly took a long time and then crashed

Sep 20 '22 15:09 mn7026

so how were you testing this, e.g. how were you creating sessions from the client, on what kind of schedule?

Sep 21 '22 03:09 danpovey

I'm testing with soa asynchronous call,It only takes 80ms when one concurrent, and it directly becomes more than 600 ms when six concurrent, and then leads to a crash

Sep 21 '22 04:09 mn7026

k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph_, supervision_segments, search_beam,output_beam, min_activate_states, max_activate_states,subsampling_factor); This code takes 40ms when one concurrent,but takes 110 ms when two concurrent,and 260ms when three concurrent,when six concurrent leads to a crash,lattice time-consuming is more than760ms.

Sep 21 '22 05:09 mn7026

The first generation of the kaldi will not crash even if it takes 1600 ms.So I wonder if the new generation does not support high concurrency.

Sep 21 '22 05:09 mn7026

Can you provide more details about the crash?

Sep 21 '22 08:09 danpovey


std::mutex mtx;

int main(){
thread t1(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t2(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t3(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t4(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
t1.join();
t2.detach();
t3.detach();
t4.detach();
}

k2::FsaClass lattice_test(torch::Tensor nnet_output_chu, k2::FsaClass decoding_graph,torch::Tensor supervision_segments)
{ 
  mtx.lock();
  std::cout << "lattice decode 1:" << "\n";
  k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph, supervision_segments, 15,4, 30,7000,4);
  std::cout << "lattice decode 2:" << "\n";
  mtx.unlock();
  return lattice;
}

Hi,When I simulate multi-threading locally, if I don't lock it, it will get stuck, and it can run normally after locking. Why is this?

I test k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph, supervision_segments, 15,4, 30,7000,4); by mutil-thread style .then, I find this code is not thread-nonsafe.

so if useing mutil-thread( e.g. web service request by qps concurrent style), it crash.

I add a mutex. it normal, and support 20 qps, every 5 sencond wav, it use 400ms.

but add a mutex is non-reasonable.

I don't know that how to modify it to make it thread-safe in the internal code.

can you help me? thank you very much.

Sep 21 '22 11:09 mn7026

icefall icefall copied to clipboard

C++ hlg_decode inference is very slow for the first three times

icefall
icefall copied to clipboard