Diego Devesa
Diego Devesa
>ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks. I think that `n_tasks` should be removed from `ggml_tensor`. For now, the easiest way to address...
>Of course, `n_tasks` should belong to the compute facility I think, it's ideal to migrate to some place else. Yes precisely! That's what I was thinking as well. I am...
That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute `ggml_cgraphs` directly, and what it needs to do that is very...
What I was thinking is that `n_threads` could be a parameter to `ggml_graph_compute_plan`, and it would also be stored in `ggml_cgraph_context` for use by `ggml_graph_compute`. For now, the CUDA runner...
Looks good, I only have a few minor nits: - In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a `std::vector` in `lama_context`....
I think this looks good. >A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom...
The LoRA files are very simple currently, it's just a tiny header with a few parameters and a bunch of tensors. I think it should work fine with the way...
Not sure if @ggerganov agrees, but I think that the best way to do this may be a simple macro that has all the variables for the 3 tensors src0/src1/dst,...
I think there is some overlap between this and the plan to implement mixed CPU/GPU evaluation in llama.cpp by splitting the graph in multiple parts and running each of them...
The idea is not to make the splits automatically, the programmer will still need to choose where to make these splits, and the user will need to specify what backend...