candle issues

Codegemma-7b-instruct failure on Metal

4

`cargo run --features metal --example gemma -- --which code-7b-it --prompt "explain isakmpd's architecture"` fails with: ``` retrieved the files in 27.197292ms loaded the model in 36.859128625s explain isakmpd's architectureError: Metal...

niklasha

Refactor usages of "from_raw_parts" to use bytemuck utilities instead

Leaving as a draft to solicit feedback on the introduction of this package. During my research on WebGPU it seemed like this was a pretty common util being used in...

tomsanbear

rigorously check the shape of two input tensor in matmul function

213r

feat: Return Cow on CpuStorage

5

Return a Cow from to_cpu_storage to avoid unnecessary copy. Address one of the issues in https://github.com/huggingface/candle/issues/1699

junjunjd

Falcon implementation issues

2

It seems that clearing cache on current Falcon model implementation is currently not working properly. Every time a second query is run, the cache is not cleared.

jorgeantonio21

Refactor candle-metal-kernels to accept an encoder instead of a command buffer

1

This PR is setting up the metal backend for the change proposed in this PR: https://github.com/huggingface/candle/pull/2037 The goal here is to have no actual change at runtime for this diff,...

tomsanbear

wait_until_completed is not working for metal device

3

During our benchmark testing, we noticed that the Candle backend for Burn was finishing up quickly for the Metal device. Upon closer inspection, we have discovered that `wait_until_completed` is not...

antimora

Unsound usages of unsafe function `slice::from_raw_parts`

2

Hi, we are researchers from [Sunlab](https://sunlab-gmu.github.io/). When we tried to scan Rust-based repositories with our own implemented bug detectors, we found that there are some potentially unsound usages of `slice::from_raw_parts`...

shinmao

Quantized much slower than llama.cpp with same model and settings...

22

quantized compiled using --> cargo build --example quantized -r --features metal Unsure of... how many layers accelerated / how many threads used / clearly different sample stages ..yet I presume...

oddpxl

How to run inference of a (very) large model across mulitple GPUs ?

4

It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example...

jorgeantonio21

candle
candle copied to clipboard

Metadata

Codegemma-7b-instruct failure on Metal

Refactor usages of "from_raw_parts" to use bytemuck utilities instead

rigorously check the shape of two input tensor in matmul function

feat: Return Cow on CpuStorage

Falcon implementation issues

Refactor candle-metal-kernels to accept an encoder instead of a command buffer

wait_until_completed is not working for metal device

Unsound usages of unsafe function `slice::from_raw_parts`

Quantized much slower than llama.cpp with same model and settings...

How to run inference of a (very) large model across mulitple GPUs ?

← Metadata

Owner

Metadata

candle candle copied to clipboard

Metadata

← Metadata

Owner

Metadata

candle
candle copied to clipboard