Hongyi Jin
Hongyi Jin
What do you mean? Like a "stop generate" button to enable early stopping?
We will post a tutorial on how to build the model into generated source code, but currently that's not ready until we migrate the vicuna-v0 to vicuna-v1.1 (The reason is...
Thanks for your patience. We've just added an instruction for building models and deploy locally. Feel free to check it out. The wasm filet will be under dist/vicuna-7b-v1 after running...
Additional info to get source code, run ``` python3 build.py --target webgpu --debug-dump ``` you will see the IR and wgsl file under dist/vicuna-7b-v1/debug
It is possible to combine the shards but that doesn't make a huge difference in the performance.
you may reduce the number of file handlers but still need to load the same size of weight from disk to memory
Really intersting phenomenon. Could you give which GPU you are using and your memory usage when running llama.cpp? I can infer which compression strategy you are using on llama.cpp with...
Got it. We are currently bringing maximum performance on macbook gpu backend, and only guarantee runnability on other backend (including nvidia gpu you are running with). Performance on other backend...
For "bringing maximum performance on macbook gpu", you need to know that the same code implementation could have dramatically different performance on different hardwares. So it's impossible to run equally...
Thank you for your suggestion. We will make a tutorial in the incoming weeks.