icefall
icefall copied to clipboard
C++ hlg_decode inference is very slow for the first three times
When loading the HLG model prediction for batch inference, the speed is very slow for the first three times, but the speed is normal after the third time. Why is this?
C++ hlg_decode inference
Which code are you using?
k2 branch v2.0-pre the code is: https://github.com/k2-fsa/k2/blob/v2.0-pre/k2/torch/bin/hlg_decode.cu
Could you add the following three lines
torch::jit::getExecutorMode() = false;
torch::jit::getProfilingMode() = false;
torch::jit::setGraphExecutorOptimize(false);
to https://github.com/k2-fsa/k2/blob/v2.0-pre/k2/torch/bin/hlg_decode.cu#L104 and try it again?
Thank you very much, the problem is solved perfectly
Hi,I made this program into a web service, but the qps is very low, and the time-consuming is much longer than the local prediction, Why is this?
torch::set_num_threads(1);Will this code make any affect?
What changes have you made?
Only add this three lines codes: torch::jit::getExecutorMode() = false; torch::jit::getProfilingMode() = false; torch::jit::setGraphExecutorOptimize(false);
I made this program into a web service, but the qps is very low
Do you recreate the model for each new request?
No, only create once
How many models are you using to serve client requests? Do you use multiple threads? Do you use batch processing?
No,only use one model,the thread is torch::set_num_threads(1),I don't use batch
torch::set_num_threads(1)
It sets the number of threads used by the model.
I suggest that you create a threadpool, where each thread handles one request.
I have a question that there is no thread pool in wenet, but online deployment is normal,but I will try again with your method
Maybe we need to know more about your setup. E.g. are you using GPU the same way in local deployment vs. web service? is it possible that one is using it but the other not? Are you actually handling multiple parallel streams, or just one? If one, threadpool may not matter.
The GPU is in the same way.But I'm handling multiple parallel streams,the local deployment is handling one by one.
I'm handling multiple parallel streams in web service,it was normal at the beginning, but after a few minutes it suddenly took a long time and then crashed
so how were you testing this, e.g. how were you creating sessions from the client, on what kind of schedule?
I'm testing with soa asynchronous call,It only takes 80ms when one concurrent, and it directly becomes more than 600 ms when six concurrent, and then leads to a crash
k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph_, supervision_segments, search_beam,output_beam, min_activate_states, max_activate_states,subsampling_factor); This code takes 40ms when one concurrent,but takes 110 ms when two concurrent,and 260ms when three concurrent,when six concurrent leads to a crash,lattice time-consuming is more than760ms.
The first generation of the kaldi will not crash even if it takes 1600 ms.So I wonder if the new generation does not support high concurrency.
Can you provide more details about the crash?
std::mutex mtx;
int main(){
thread t1(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t2(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t3(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
thread t4(lattice_test,nnet_output_chu, decoding_graph,supervision_segments);
t1.join();
t2.detach();
t3.detach();
t4.detach();
}
k2::FsaClass lattice_test(torch::Tensor nnet_output_chu, k2::FsaClass decoding_graph,torch::Tensor supervision_segments)
{
mtx.lock();
std::cout << "lattice decode 1:" << "\n";
k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph, supervision_segments, 15,4, 30,7000,4);
std::cout << "lattice decode 2:" << "\n";
mtx.unlock();
return lattice;
}
Hi,When I simulate multi-threading locally, if I don't lock it, it will get stuck, and it can run normally after locking. Why is this?
I test k2::FsaClass lattice = k2::GetLattice(nnet_output_chu, decoding_graph, supervision_segments, 15,4, 30,7000,4); by mutil-thread style .then, I find this code is not thread-nonsafe.
so if useing mutil-thread( e.g. web service request by qps concurrent style), it crash.
I add a mutex. it normal, and support 20 qps, every 5 sencond wav, it use 400ms.
but add a mutex is non-reasonable.
I don't know that how to modify it to make it thread-safe in the internal code.
can you help me? thank you very much.