Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

Send/receive operators via MPI

My goal for now is to implement this for CUDA in llama.cpp, and to show how this could be done to split computation between the CPU and CUDA in a...

ggml : remove `src0` and `src1` from `ggml_tensor` and rename `opt` to `src`

I would suggest considering using an array for all `src` nodes instead. This will simplify the code when we need to look at the list of parents of a node....

ggml : remove `src0` and `src1` from `ggml_tensor` and rename `opt` to `src`

@nullhook I am not sure if I understand what you mean, the total number of source tensors available to ops will be the same, and can be increased as needed....

llama : fix K-shift with quantized K (wip)

There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.

ggml : add Flash Attention

Since we are doing this from scratch, wouldn't it be better to remove the custom attention mask entirely and pass a list of KV cells used in each sequence? Considering...

ggml : add Flash Attention

We could use a vector with dimension `[num_seqs]` that contains the length of the sequences, and a 2D tensor with dimensions `[max_seq_len, num_seqs]` that contains the KV cells in each...

It seems that vLLM has added a new version of paged attention since it looked into the implementation (https://github.com/vllm-project/vllm/pull/1348). I am not sure what are the changes, but I think...

ggml : add Flash Attention

Alibi could also be done in this kernel.

lookup: complement data from context with general text statistics

It should be possible to generate multiple sequences simultaneously with the batch API, that should be a lot faster.

Mixtral 8x7b lora conversion support

It's probably due to the duplicated `base_model.model` prefix in the tensor names. The conversion script only removes this prefix once.