candle
candle copied to clipboard
Tensor::to_scalar very high latency
Bug description: When converting a cuda tensor, which is indexed from an other large tensor to CPU memory scalar, the time consumption of this process is unusually long.
Minimal reproducible example:
use candle_core::{Device, IndexOp, Tensor};
use std::time::Instant;
fn time_it<T>(func: impl FnOnce() -> T) -> T {
let start = Instant::now();
let result = func();
let duration = start.elapsed();
println!("{:?}", duration);
result
}
fn main() {
let cuda_dev = Device::new_cuda(0).unwrap();
let cpu_dev = Device::Cpu;
let shape = [10000, 10000];
let tensor = Tensor::rand(0f32, 1f32, &shape, &cpu_dev).unwrap();
println!("Tensor on CPU: {:?}", tensor);
let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());
println!("Tensor on GPU: {:?}", tensor.layout());
let cell = time_it(|| tensor.i((42, 42)).unwrap());
let scalar = time_it(|| cell.to_scalar::<f32>().unwrap());
dbg!(scalar);
}
In the most recent crate version v0.5.1, the line let scalar = time_it(|| cell.to_scalar::<f32>().unwrap());
will have the identical time consumption of line let tensor = time_it(|| tensor.to_device(&cuda_dev).unwrap());
.
In my investigation, the problem is because candle's tensor uses a storage + layout approach to represent slice, this approach in most of the time is effective, but it's inefficient when copying a small slice from cuda mem to CPU mem because it will still copy the full storage and use layout to offset to the slice you want. this creates massive latency(Before my attempt to fix ~300ms, after my attempt to fix ~17us).