nonnull comments

Results 20 comments of


                                            nonnull

Deterministic generations

A lofty goal! Be aware that under the hood llama (and indeed most ANNs) use floating-point, and floating-point determinism is a rabbit hole with no bottom. Some particular issues that...

Initial prompt tokens should be loaded instantly and not require inference

You seem to be thinking that a transformer is a function `F(tokens[0..=n-1]) -> probs[n]`. It isn't. It's a function `F(tokens[0..=n-1], probs[0..=n-1]) -> probs[n]`: You need the output probabilities of the...

Scale buf_size linearly with n_ctx

I've been hitting this with the 65B model in oneshot mode with a 2048-token context. So it's not just interactive sessions that are affected. This is fairly easily reproducible with...

Scale buf_size linearly with n_ctx

> Increasing batch size also makes llama.cpp run out of memory, so any solution that only considers the context size and not the batch size is likely wrong. Ah. That...

Scale buf_size linearly with n_ctx

I haven't looked; how does LLaMA handle prompts that are smaller than context size? E.g. a 1024-token prompt with 2048-token context size. Does it just truncate to an effective context...

gradio server recursion error triggered by gradio client

I was debugging an issue with the same symptoms today. As it turns out, the problem was that I was returning a Gradio component itself from an event listener instead...

Investigate alternative approach for Q4 quantization

> Reading the comments above - yeah, if we can efficiently implement a lookup table `int8/int4->float16` using AVX/NEON, then it might be really worth trying the non-uniform approach. For the...

Investigate alternative approach for Q4 quantization

Basic idea for a int4->f16 lookup table on AVX2: Low nibbles: 1. Mask out high nibbles of each input byte, because `VPSHUFB` uses the high bit to clear bytes to...

Investigate alternative approach for Q4 quantization

> > The bins are now much more evenly utilized. > > I am wondering how we could update the algorithm so also the first bin is utilized, currently it...

Investigate alternative approach for Q4 quantization

> Perhaps Posit arithmetic could be valuable? http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf Posits offer more dynamic range, at the expense of less accuracy for large numbers. If the largest weights matter the most, and...