tch-rs
tch-rs copied to clipboard
Speed on `Device::Mps` is much slower than `Device::Cpu` on Apple M1
I wrote a test to see how fast a simple model can run on Apple M1 chips, using tch
. The actual model, called SanityModel
, is just a simple 4 layer feed forward network. I have the following test to evaluate how much time it takes for 1000 iterations.
#[test]
pub fn test_run() {
let device = Device::Cpu;
let vs = nn::VarStore::new(device);
let model = SanityModel::new(vs.root());
let start = Instant::now();
for _ in 0..1000 {
model.forward_t(
&tch::Tensor::rand(&[1, 5, 10, 12], (Kind::Float, device)),
false);
}
let duration: Duration = start.elapsed();
println!("Time for 1000 iterations {:?}", duration);
}
With Device::Cpu
Time for 1000 iterations 415.461375ms
test models::sanity::test_run ... ok
With Device::Mps
Time for 1000 iterations 467.140166ms
test models::sanity::test_run ... ok
The problems I have here are,
- there is no appreciable performance improvement by using MPS
- the time is slightly worse, I've had this go up to more than 700ms.
Environment
- OS: macOS Ventura 13.2.1
- Kernel: Darwin 22.3.0
- Arch: arm64
- cargo 1.67.0 (8ecd4f20a 2023-01-10)
- libtorch version
2.0.0
for M1 (arm64)
Could you try running the same thing in Python? I don't think there should be much differences between MPS and Cpu in the way tch
handle the computation so I would expect similar things to happen on the Python side.
Also note that this is a small computation, so there may well be some overhead that explains the absence of improvement. You may want to check this [PyTorch discussion] on the relative poor performance of MPS on the Python side too.
Looks like you're right. Even the Python version is slower on MPS.
[cpu] Time for 1000 iterations 184.48 ms
[mps] Time for 1000 iterations 650.47 ms
If you don't mind, I'd like to try this test with CUDA to see the difference before closing this issue.
But also, I don't understand why the Python version is faster on CPU than the tch
version. And not just by a little bit.
I came across this as well. It has to do with an overhead copying data to GPU and running it there. I was testing a small model (MNIST) and it ended it up being fast on CPU but slower on GPU (M1). However, once I increased the size of my model, the CPU was much slower, and GPU remained the same (assuming the overhead was the same and computation was the same since GPU has higher capacity).
So I would suggest CPU with a bigger model.
@zrthxn @antimora try to increase batch_size and the inference on GPU will be much faster than on CPU.
Yeah,I also meet the problem when i use tch to inference Yolov8s.it's seems that copy data upload to gpu takes too much time, but what confuses me is that I have already put the model and inputs into device in advance.btw,when i use python to infer,the result is GPU faster than CPU.
Rust
pub fn predict(&self, image: &Tensor) -> Vec<Bbox> {
let start_time=Instant::now();
let pred = self
.model
.forward_t(&image, false);
let end_time=Instant::now();
let elapsed_time=end_time.duration_since(start_time);
println!("YOLOv8 inference time:{} ms",elapsed_time.as_millis());
let pred=pred.to_device(tch::Device::Cpu);
let start_time=Instant::now();
let result = self.non_max_suppression(&pred);
let end_time=Instant::now();
let elapsed_time=end_time.duration_since(start_time);
println!("YOLOv8 nms time:{} ms",elapsed_time.as_millis());
result
}
// and main.rs
let device = tch::Device::cuda_if_available(); // compare to tch::Device::Cpu;
println!("Run inference by device={:?}", device);
let mut yolov8 = yolo::YOLO::new(weights, 640, 640, 0.25, 0.65, 100, device);
let img = yolo::YOLO::preprocess(&mut yolov8, &img_path).to_device(yolov8.device);
let results = yolov8.predict(&img);
GPU result:
CPU result:
There is almost a six-fold difference in the inference of two devices
Python
model=YOLO('./yolov8s.torchscript',task='detect')
res=model.predict('bus.jpg',device='0') # device='cpu'
GPU result:
CPU result: