eljrte issues

Results 8 issues of


                                            eljrte

Does GPU experience load imbalance when dealing with different KVcache length decoding query?

First of all, thank you for the awesome project! I have a question : When there are two queries(len=1) with different KVcache lens(like one is very long and one is...

what's the meaning of paged_kv_block_size

I am working on the pytest of flash_attn_splitkv. I want to know what's the meaning of paged_kv_block_size. Does it mean how many tokens' k/v cache in a paged block? I...

How to solve the data dependency problem when building the computing graph?

When I am building a computing graph and creating two tensors eg. A & B . tensor B's shape(ne) is dependent on A's data member, which is unknown before compute_forward...

Question about the perplexity

I am Reproducing the results of perplexity. And I get confused about the following comment lines in examples/perplexity/perplexity.cpp: // Download: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research // Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` // Output:...

question

eljrte

Does GPU experience load imbalance when dealing with different KVcache length decoding query?

what's the meaning of paged_kv_block_size

How to solve the data dependency problem when building the computing graph?

Question about the perplexity

How to transform the orignal Mixtral 8*7B into the mixed HQQ quantized model ?

Can I run inference on a 16GB GPU?

[Usage] Can I only offer text prompt, not offer an image?

How to visualize Fig 4, Distribution of expert loading with various cross-modality inputs?