candle Tensor::to_scalar very high latency

Tensor::to_scalar very high latency

Open RoggeOhta opened this issue 8 months ago • 6 comments

Bug description: When converting a cuda tensor, which is indexed from an other large tensor to CPU memory scalar, the time consumption of this process is unusually long.

Minimal reproducible example:

use candle_core::{Device, IndexOp, Tensor};
use std::time::Instant;

fn time_it<T>(func: impl FnOnce() -> T) -> T {
    let start = Instant::now();
    let result = func();
    let duration = start.elapsed();
    println!("{:?}", duration);
    result
}

fn main() {
    let cuda_dev = Device::new_cuda(0).unwrap();
    let cpu_dev = Device::Cpu;

    let shape = [10000, 10000];
    let tensor = Tensor::rand(0f32, 1f32, &shape, &cpu_dev).unwrap();
    println!("Tensor on CPU: {:?}", tensor);

    let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());
    println!("Tensor on GPU: {:?}", tensor.layout());

    let cell = time_it(|| tensor.i((42, 42)).unwrap());

    let scalar = time_it(|| cell.to_scalar::<f32>().unwrap());
    dbg!(scalar);
}

In the most recent crate version v0.5.1, the line let scalar = time_it(|| cell.to_scalar::<f32>().unwrap()); will have the identical time consumption of line let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());.

In my investigation, the problem is because candle's tensor uses a storage + layout approach to represent slice, this approach in most of the time is effective, but it's inefficient when copying a small slice from cuda mem to CPU mem because it will still copy the full storage and use layout to offset to the slice you want. this creates massive latency(Before my attempt to fix ~300ms, after my attempt to fix ~17us).

Jun 03 '24 09:06 RoggeOhta

candle candle copied to clipboard

Tensor::to_scalar very high latency

candle
candle copied to clipboard