eljrte
eljrte
First of all, thank you for the awesome project! I have a question : When there are two queries(len=1) with different KVcache lens(like one is very long and one is...
I am working on the pytest of flash_attn_splitkv. I want to know what's the meaning of paged_kv_block_size. Does it mean how many tokens' k/v cache in a paged block? I...
When I am building a computing graph and creating two tensors eg. A & B . tensor B's shape(ne) is dependent on A's data member, which is unknown before compute_forward...
I am Reproducing the results of perplexity. And I get confused about the following comment lines in examples/perplexity/perplexity.cpp: // Download: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research // Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` // Output:...
Really appreciate your great work. As it helps me to run MoE on a consumer GPU. I wonder how u transform the original Mixtral 8*7B into the quantized one using...
Really apreciate your great work! I wonder if I can run your code on a 16 GB GPU?
### Describe the issue When performing inference with the model, can I input only text without providing an image? for example predict.py
Really appreciate your work in MoE for multi modality LLM. I'd like to know how to visualize Fig 4 to show the preference of expert for different modalities? Does it...