Results 24 comments of Aaron Miller

ah nuts - alright, I'm fetching the old version to test - I do think it'd be good to get a code model sooner than later but I also think...

this got a bit hairy after the multiple implementation split without causing multiple embedded copies of the tokenizer configs, but should be a bit more doable as of the `prompt()`...

> This can be closed since the tokenizer changes upstream? no there's still no upstream fix for this - it requires file format changes so its not likely happening upstream...

If a major file format change is going to happen again the tokenizer configs for the models using huggingface `tokenizers` BPE/GPT-2-like tokenizers ought to be improved (i.e. all but the...

I'm also running into this with decently sized images - given a folder with a hefty enough amount of 5K-8K jpegs I can reliably *crash* the renderer process if I...

My comment about `MADV_SEQUENTIAL` was assuming you were trying to implement zero-copy approach and have the inference-time code directly use the model from the mapping rather than still copying everything...

> Cool - will take a look soon Is this using actual MQA or it's still doing the trick with the copies? does copies with `ggml_repeat` presently - also wound...

> Are you working on a 40B branch already ? I'm not presently - being that it's *big* its a bit more inconvenient to hack on as I'd need to...

> > > Are you working on a 40B branch already ? > > > > > > I'm not presently - being that it's _big_ its a bit more...

for me its exactly 1718 ---- but I just realized I can get the same behavior with a q4_0 model if I bump it to 2570 (maybe less, didn't narrow...