Diego Devesa
Diego Devesa
My goal for now is to implement this for CUDA in llama.cpp, and to show how this could be done to split computation between the CPU and CUDA in a...
I would suggest considering using an array for all `src` nodes instead. This will simplify the code when we need to look at the list of parents of a node....
@nullhook I am not sure if I understand what you mean, the total number of source tensors available to ops will be the same, and can be increased as needed....
There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.
Since we are doing this from scratch, wouldn't it be better to remove the custom attention mask entirely and pass a list of KV cells used in each sequence? Considering...
We could use a vector with dimension `[num_seqs]` that contains the length of the sequences, and a 2D tensor with dimensions `[max_seq_len, num_seqs]` that contains the KV cells in each...
It seems that vLLM has added a new version of paged attention since it looked into the implementation (https://github.com/vllm-project/vllm/pull/1348). I am not sure what are the changes, but I think...
Alibi could also be done in this kernel.
It should be possible to generate multiple sequences simultaneously with the batch API, that should be a lot faster.
It's probably due to the duplicated `base_model.model` prefix in the tensor names. The conversion script only removes this prefix once.