Philip Turner comments

Results 340 comments of


                                            Philip Turner

MPS device appears much slower than CPU on M1 Mac Pro

> but sadly I didn't observe any power consumption by the ANE module, which is too bad. The neural engine can't be used for training anyway. It only supports Float16,...

MPS device appears much slower than CPU on M1 Mac Pro

From https://github.com/pytorch/pytorch/issues/77753#issuecomment-1132230314, the GPU may be slower because it spends more of its time writing to new memory than actually executing. Also, since it continues writing to new memory on...

MPS device appears much slower than CPU on M1 Mac Pro

I do have a workaround planned for my own framework ([s4tf/s4tf](https://github.com/s4tf/s4tf)) where I will bypass this restriction. It's described in https://github.com/AnarchoSystems/DeepSwift/issues/1#issuecomment-1129891796, although please don't comment on that thread. @albanD does...

MPS device appears much slower than CPU on M1 Mac Pro

I wonder if we could convert this to actual MPS code in Swift, then profile how long that code takes. That would determine whether the bottleneck is PyTorch’s fault and...

MPS device appears much slower than CPU on M1 Mac Pro

This might be why it’s so slow. Worth reading if you have the time. https://discuss.pytorch.org/t/sequential-throughput-of-gpu-execution/156303

MPS device appears much slower than CPU on M1 Mac Pro

The spike in microsecond-level overhead (CPU time avg) was discussed [here](https://github.com/pytorch/pytorch/issues/82707#issuecomment-1204672455). I think I’ve found a solution to it, but haven’t put it into practice with an RNN.

MPS device appears much slower than CPU on M1 Mac Pro

> Any recent plan to implement it? I’m not planning to implement it in PyTorch; however, it’s open source and I’ve explained it in great depth. Someone else could look...

MPS device appears much slower than CPU on M1 Mac Pro

Regarding a working RNN implementation, it probably won't happen any time within the next few weeks. I'm juggling a bunch of other projects simultaneously, so things will happen slowly regarding...

MPS device appears much slower than CPU on M1 Mac Pro

That strongly suggests it's a driver overhead bottleneck. The CPU takes 1/4 as long when you go from 1000 to 250. Perhaps it should take 1/16 ((1/4)^2) as long in...

MPS device appears much slower than CPU on M1 Mac Pro

Based on my experiments, driver overhead can be reduced significantly from a naive implementation. In the best case, there is a 100x reduction. In the average case, there is a...