Endre Stølsvik
Endre Stølsvik
Just kick pastes from Gitter, sincefamily calls: Point is, if you have (as per your example) -Xmx2G maxbytes=8G and maxphysicalbytes=8G, then you ACTUALLY only have 6GB available for off-heap. Because...
So, basically, how this works out for dl4j: You need to set maxphysicalbytes to basically the highest number you want the process to take, including all three of the JVM...
I believe this goes for every model, and it is clearly a bit problematic. That is, the prompt template (and "history template") of the "chat interfaces" with the different models...
Oh, I guess this is exactly what https://github.com/Mozilla-Ocho/llamafile/issues/65 is about. Pointing to this blogpost: https://huggingface.co/blog/chat-templates
Definitely would be a welcome addition, yes! :+1: Edit: There is already an issue in 'llama.cpp' about lots of features that should go into the server, I added a comment...
The OpenAI-compatible embeddings-endpoint is directly mentioned here, I realize: https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-1858542650
(Funny, that - exactly one year later, I am again thinking about this, about to create an issue, but the "The following issues might be related." caught me.) I think...
Judging by serialization/deserialization times, and compress/decompress times (once I made these timings available!), this won't shave more than at most a few milliseconds off the total processing time for a...
Interesting blog post about UTF-8 encoding: http://psy-lob-saw.blogspot.no/2012/12/encode-utf-8-string-to-bytebuffer-faster.html
I had evidently forgotten this case when making #45, which is identical - and closed by https://github.com/centiservice/mats3/commit/84ba5747efe4c5dacc01098601f2641a2c196831 So, I'll reuse this for what I had forgotten there: Make it available...