Johannes Gäßler

Results 235 comments of Johannes Gäßler

I tried a simple Python script to check whether the method works: ```python #!/usr/bin/env python3 import numpy as np import random SAMPLE_SIZE = 10000 VOCAB_SIZE = 10 BRANCHING_RATIO = 4...

Related discussion: https://github.com/flexflow/FlexFlow/issues/1302 I also noticed that my Python script had a bug regarding the normation. This version should be fixed: ```python #!/usr/bin/env python3 import numpy as np import random...

>It's already slightly faster for TG after a long prompt, but the PP speed is ~15% lower compared to master. Still looking into it - any ideas are appreciated. >I'm...

I mean, I don't have the hardware to profile the Metal code but you should check how long the kernel takes compared to the equivalent GEMM kernels. In my experience...

I tried explicitly checking for work boundaries but the results were not good. But of course it's always possible that I just did it wrong. In any case, it would...

I fixed the off-by-one issue with n-gram size and uploaded a fixed `lookup.bin`. I also fixed a bug where wikitext-2 instead of wikitext-103 was used as the hard-coded input file....

I tested this PR on my RTX 3090 using [Yi 34b 200k RPMerge q4_K_M](https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge-iMat.GGUF). The tokenizer is different so a different `lookup.bin` file is needed (I uploaded it). Also because...

I've been thinking, maybe you could synthetically generate the data to use for the static lookup. Basically, just generate a bunch of outputs for different seeds and use them as...

I ran some synthetic data generation as a test and at a large sample size the rate of data generation for Mixtral q8_0 with 3xP40 is ~7.9 MiB / 24...

Only for dense models. For MoE models the amount of data that you need to load increases by a factor of 4 for batched generation. Or if you can fit...