Ziren Wang
Ziren Wang
We have upgraded our codebase from c++ to python, now the configuration would be more clear so that you can clearly define the KV cache capacity in your own code....
Here batchsize means the token number per round, which is not the number of request.
To run in multi-gpu, we have built an engine here [Multi-GPU engine script](https://github.com/efeslab/Nanoflow/blob/Nanoflow-python/entry/test_multi_gpu3.py), which would be helpful for you to build your own multi-gpu model.
We have changed our codebase from c++ to python, now it shoule be easier to launch the program.
Here it is, which is in the branch Nanoflow-python, auto_search/new_search.py in the codebase. For this New version of auto-search, there are two stages of auto search. The first is that...
We have changed our codebase from c++ to python, which would be more clear to debug. Moreover, we have deleted pllm_python module in the python codebase since it's useful in...
Is this fixed now? Maybe you can try the lastest version of our codebase now, it shoule be work.
M is just a parameter for ensuring the correctness of sequential execution in one stream, which is called "big-M formulations". So M should be large enough in this case, it...
May you try again in our codebase now? Since we have upgraded our codebase from c++ to python so that it shoule be easier to implement a new model in...
FileNotFoundError when running run_llama3.py: missing ../auto_search/8B_search_result_large_btz.json
You could try test_correctness first. Here test_performance() is designed for testing the performance of our auto search result. For auto search, you need to do profile first, and then run...