tch-rs icon indicating copy to clipboard operation
tch-rs copied to clipboard

Speed on `Device::Mps` is much slower than `Device::Cpu` on Apple M1

Open zrthxn opened this issue 1 year ago • 6 comments

I wrote a test to see how fast a simple model can run on Apple M1 chips, using tch. The actual model, called SanityModel, is just a simple 4 layer feed forward network. I have the following test to evaluate how much time it takes for 1000 iterations.

#[test]
pub fn test_run() {
    let device = Device::Cpu;
    let vs = nn::VarStore::new(device);
    let model = SanityModel::new(vs.root());

    let start = Instant::now();
    for _ in 0..1000 {
        model.forward_t(
            &tch::Tensor::rand(&[1, 5, 10, 12], (Kind::Float, device)),
            false);
    }
    let duration: Duration = start.elapsed();
    println!("Time for 1000 iterations {:?}", duration);
}

With Device::Cpu

Time for 1000 iterations 415.461375ms
test models::sanity::test_run ... ok

With Device::Mps

Time for 1000 iterations 467.140166ms
test models::sanity::test_run ... ok

The problems I have here are,

  • there is no appreciable performance improvement by using MPS
  • the time is slightly worse, I've had this go up to more than 700ms.

Environment

  • OS: macOS Ventura 13.2.1
  • Kernel: Darwin 22.3.0
  • Arch: arm64
  • cargo 1.67.0 (8ecd4f20a 2023-01-10)
  • libtorch version 2.0.0 for M1 (arm64)

zrthxn avatar Apr 24 '23 17:04 zrthxn

Could you try running the same thing in Python? I don't think there should be much differences between MPS and Cpu in the way tch handle the computation so I would expect similar things to happen on the Python side. Also note that this is a small computation, so there may well be some overhead that explains the absence of improvement. You may want to check this [PyTorch discussion] on the relative poor performance of MPS on the Python side too.

LaurentMazare avatar Apr 24 '23 17:04 LaurentMazare

Looks like you're right. Even the Python version is slower on MPS.

[cpu] Time for 1000 iterations 184.48 ms
[mps] Time for 1000 iterations 650.47 ms

If you don't mind, I'd like to try this test with CUDA to see the difference before closing this issue.

zrthxn avatar Apr 24 '23 17:04 zrthxn

But also, I don't understand why the Python version is faster on CPU than the tch version. And not just by a little bit.

zrthxn avatar Apr 24 '23 17:04 zrthxn

I came across this as well. It has to do with an overhead copying data to GPU and running it there. I was testing a small model (MNIST) and it ended it up being fast on CPU but slower on GPU (M1). However, once I increased the size of my model, the CPU was much slower, and GPU remained the same (assuming the overhead was the same and computation was the same since GPU has higher capacity).

So I would suggest CPU with a bigger model.

antimora avatar Apr 25 '23 01:04 antimora

@zrthxn @antimora try to increase batch_size and the inference on GPU will be much faster than on CPU.

igor-yusupov avatar Apr 25 '23 08:04 igor-yusupov

Yeah,I also meet the problem when i use tch to inference Yolov8s.it's seems that copy data upload to gpu takes too much time, but what confuses me is that I have already put the model and inputs into device in advance.btw,when i use python to infer,the result is GPU faster than CPU.

Rust

pub fn predict(&self, image: &Tensor) -> Vec<Bbox> {
        let start_time=Instant::now();
        let pred = self
            .model
            .forward_t(&image, false);
        let end_time=Instant::now();
        let elapsed_time=end_time.duration_since(start_time);
        println!("YOLOv8 inference time:{} ms",elapsed_time.as_millis());

        let pred=pred.to_device(tch::Device::Cpu);
        let start_time=Instant::now();
        let result = self.non_max_suppression(&pred);
        let end_time=Instant::now();
        let elapsed_time=end_time.duration_since(start_time);
        println!("YOLOv8 nms time:{} ms",elapsed_time.as_millis());
        result
    }

// and main.rs
let device = tch::Device::cuda_if_available();  // compare to tch::Device::Cpu;
 println!("Run inference by device={:?}", device);
 let mut yolov8 = yolo::YOLO::new(weights, 640, 640, 0.25, 0.65, 100, device);
 let img = yolo::YOLO::preprocess(&mut yolov8, &img_path).to_device(yolov8.device);
 let results = yolov8.predict(&img);

GPU result: image

CPU result: image

There is almost a six-fold difference in the inference of two devices

Python

model=YOLO('./yolov8s.torchscript',task='detect')
res=model.predict('bus.jpg',device='0') # device='cpu'

GPU result: image

CPU result: image

Yshelgi avatar Jul 01 '23 23:07 Yshelgi