candle issues

Adapting llama_multiprocess to use `rsmpi`

6

Has anyone considered adapting `llama_multiprocess` to run on multiple machines instead of multiple processes? I've started by using the `SystemCommunicator` from `rsmpi` library to replace `nccl::Comm`, but the debugging seems...

b0xtch

Retrieving softmax_lse in candle-flash-attn

1

Hello, In flash-attn, the logsumexp of the softmax is not output. But it would be nice if it could be output too as it is necessary to compute long context...

daguix

load_image and load_image_and_resize return different permutations

In _candle-examples/src/lib.rs_ the function _load_image_ returns (channels, height, width), but the function _load_image_and_resize_ returns (channels, width, height): ```rust let data = Tensor::from_vec(data, (height, width, 3), &Device::Cpu)?.permute((2, 0, 1))?; ``` and...

jeroenvlek

Documentation fixes and improvements

I ended up on [the documentation page for Linear](https://docs.rs/candle-nn/latest/candle_nn/linear/struct.Linear.html) which surprisingly didn't have any documentation on what that layer does. Only after reading the code did I see that the...

iwanders

Positional encoding in DINOv2: Case without interpolation

Hi, I have been reviewing DINOv2 Candle code and I noticed most likely a bug (unless I misunderstood the code). As far as I understand, the function **interpolate_pos_encoding()** is used...

v-espitalier

Quantized-t5 models on Cuda

2

Hello! Are there any plans on implementing quantized-t5 models on CUDA devices? I'm looking for a couple of days to find the solution or implement a CUDA support for https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized-t5/main.rs...

helizac

Metal memory leak multiplying matrices

5

Running this code multiplying a 784x100 matrix times a 100x10 matrix seems to leak memory. The memory usage gradually increases to more than 5 gigabytes when running with the metal...

ealmloff

How to select which GPU to use

3

We are working with the stable diffusion example. How do we select which GPU device on our system to use for the rendering? thanks.

donkey-donkey

Mamba model is broken with `f16` precision

1

Running the command: ```cargo run --example mamba --release --features metal -- --prompt "Tell me a joke please" --dtype f16``` does not work. The problem seems to lie in code: ```rs...

jorgeantonio21

Slow YOLOv8 using MKL

I had great success using MKL for some models, especially BERT-likes with huge improvements in speed (up to x25). Here the speedup is only of about 1 second. However, I'm...

hugohmn

candle
candle copied to clipboard

Metadata

Adapting llama_multiprocess to use `rsmpi`

Retrieving softmax_lse in candle-flash-attn

load_image and load_image_and_resize return different permutations

Documentation fixes and improvements

Positional encoding in DINOv2: Case without interpolation

Quantized-t5 models on Cuda

Metal memory leak multiplying matrices

How to select which GPU to use

Mamba model is broken with `f16` precision

Slow YOLOv8 using MKL

← Metadata

Owner

Metadata

candle candle copied to clipboard

Metadata

← Metadata

Owner

Metadata

candle
candle copied to clipboard