Hongyi Jin comments

Results 27 comments of


                                            Hongyi Jin

Add ability to stop it before finishing to write a prompt

What do you mean? Like a "stop generate" button to enable early stopping?

Where is the source code for vicuna-7b_webgpu.wasm?

We will post a tutorial on how to build the model into generated source code, but currently that's not ready until we migrate the vicuna-v0 to vicuna-v1.1 (The reason is...

Where is the source code for vicuna-7b_webgpu.wasm?

Thanks for your patience. We've just added an instruction for building models and deploy locally. Feel free to check it out. The wasm filet will be under dist/vicuna-7b-v1 after running...

Where is the source code for vicuna-7b_webgpu.wasm?

Additional info to get source code, run ``` python3 build.py --target webgpu --debug-dump ``` you will see the IR and wgsl file under dist/vicuna-7b-v1/debug

One file vs. shards - is there a difference in performance?

It is possible to combine the shards but that doesn't make a huge difference in the performance.

One file vs. shards - is there a difference in performance?

you may reduce the number of file handlers but still need to load the same size of weight from disk to memory

Run llama.cpp models

Really intersting phenomenon. Could you give which GPU you are using and your memory usage when running llama.cpp? I can infer which compression strategy you are using on llama.cpp with...

Got it. We are currently bringing maximum performance on macbook gpu backend, and only guarantee runnability on other backend (including nvidia gpu you are running with). Performance on other backend...

Run llama.cpp models

For "bringing maximum performance on macbook gpu", you need to know that the same code implementation could have dramatically different performance on different hardwares. So it's impossible to run equally...

Can you write a tutorial for local deployment?

Thank you for your suggestion. We will make a tutorial in the incoming weeks.