AXKuhta
AXKuhta
Exporting to ONNX is something that I've been tinkering with and I can report that the 169m RWKV-4 model does run in browser. Here's my code: https://github.com/AXKuhta/RWKV-LM/tree/onnx There's two things...
> For example, if you prompt the 430m model with "The capital of France is" it continues with "first of the, the city of Paris" That seems familiar! ``` The...
@josephrocca I had to host the demo locally because huggingface keeps terminating the model downloads for some reason, but otherwise I can confirm that it works on my machine. Good...
It looks like the webgl backend has a lot of limitations. I did some testing by stripping out different parts of the model in order to see if I can...
I have been able to force the full model to run on webgl, but it doesn't produce anything coherent, so something's still broken: https://github.com/AXKuhta/RWKV-LM/tree/onnx_webgl @BlinkDL The "cannot resolve operator 'Max'...
@BlinkDL After some painstaking debugging I got it to produce coherent output on webgl. The fix was really bizarre: add `+ 0.0` in a bunch of places. Some nodes on...
@josephrocca I think it's better to keep all the web models in one place so I made two PRs in your huggingface repository. Oh, and by the way, I also...
@BlinkDL The final [768, 50277] matmul is the slowest component. It's almost as slow as the entire model on WASM, which is kind of surprising, considering that GPUs are supposed...
>And actually you can skip the final matmul when scanning the prompt @BlinkDL Ooh, somehow I didn't think of that before! There is a ["only_execute_path_to_fetches" switch](https://github.com/microsoft/onnxruntime/blob/80c8d934b84f0083558b61550f6180e6d8f42423/include/onnxruntime/core/framework/run_options.h#L28) in onnxruntime that can...
>Oh and please check the speed of onnxruntime in pytorch Here's some performance numbers for RWKV-4 with pytorch and native onnxruntime: ``` Native pytorch + onnxruntime 169m model Intel Core...