hexisyztem
hexisyztem
> Because CMAKE_CUDA_ARCHITECTURES 60/61 don't support atomicAdd(__half*, float)
> > What is your GPU device model and what is the cuda version? If you compile on your own device, you can take a try to modify here(https://github.com/bytedance/lightseq/blob/master/CMakeLists.txt#L88) to...
> > > What is your GPU device model and what is the cuda version? If you compile on your own device, you can take a try to modify here(https://github.com/bytedance/lightseq/blob/master/CMakeLists.txt#L88)...
If you want to run the test code, you can directly python3 test/xxx.py
If you are talking about multiple models performing inference at the same time, then you can implement multi-card deployment through triton_server, and lightseq provides a solution for docking triton-server
What is your configuration file, I guess you may have assigned all the models to GPU-0, but this requires me to analyze it in combination with your configuration
By the way, instance_group - count needs to be seted to 1. https://github.com/bytedance/lightseq/blob/master/examples/triton_backend/model_repo/transformer_example/config.pbtxt#L25
We do not currently support it. After we complete the development of the new architecture, we will find ways to support more models at a low cost.
Sorry, this is the design of the new architecture, which uses some fixed syntax to manage GPU memory sharing.
The set_ancestor function is to assign cache_k to a continuous segment in total_cache_k. Specific to the case you gave here, cache_k_out can be removed.