Victor Nogueira comments

Results 99 comments of


                                            Victor Nogueira

After upgrading to version 1.8.0, the async function `loadModelFromUrl` is not completing when using large models

Ah, no worries @ngxson! My intention was just to document it, so other devs facing this issue can get some clue. But I'm not waiting it to be fixed, as...

After upgrading to version 1.8.0, the async function `loadModelFromUrl` is not completing when using large models

After the launch of iOS 18, most of those issues related to out-of-memory seem to have been gone! 🎉 I noticed that they (Apple) now force Safari to hard-reload the...

PostMessage: Data cannot be cloned, out of memory

I've got a 7B Q2_K model working! (Total file size: 2.72 GB) I was able to use a context up to `n_ctx: 9 * 1024` using `cache_type_k: "q4_0"`. The inference...

PostMessage: Data cannot be cloned, out of memory

Now I've got a [7B Q3_K_M](https://huggingface.co/Felladrin/gguf-sharded-Mistral-7B-OpenOrca) working! (Total file size: 3.52 GB) I think the previous attempt didn't work because I was setting a too-small split size. I've increased to...

PostMessage: Data cannot be cloned, out of memory

@flatsiedatsie, please confirm if you have set `cache_type_k: "q4_0"` when loading the model. It seems to be failing due to `cache_type_k` being `f16`, as per the screenshot.

PostMessage: Data cannot be cloned, out of memory

I'm happy to see it too! I usually leave the `n_batch` unset. By default it will fill it with the same value of `n_ctx`, and I haven't had problems with...

performance expectations

One important consideration is that certain browsers, such as Brave, may alter the value of `navigator.hardwareConcurrency` to prevent fingerprinting. - Reference: https://github.com/brave/brave-browser/issues/10808 As a result, it is possible that the...

performance expectations

>Multithreading is not turning on Brave and Firefox. Also, is there any way to increase the performance without any middleware and when using the model file from local? > >...

How would you implement RAG / Document chat?

For a small embedding model good for this case, I can recommend this one: [sentence-transformers/multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) ([GGUF](https://huggingface.co/Felladrin/gguf-multi-qa-MiniLM-L6-cos-v1))

Should all models now be chunked?

I noticed a significant benefit in splitting the models, mostly due to the cache size constraints of Safari. Mobile Safari has a cache limit of 300MB, while Desktop Safari has...