tart icon indicating copy to clipboard operation
tart copied to clipboard

Native ARM Mac GPU usage (metal performance shaders)

Open tylerjereddy opened this issue 2 years ago • 7 comments

Since Cirrus CI offers some native arm Mac (M chip) services, I was wondering if there might be some documentation/examples/options for using the GPU component (i.e., the metal performance shaders) when testing with i.e., torch which has an mps backend: https://pytorch.org/docs/stable/notes/mps.html

I did a little experiment here: https://github.com/tylerjereddy/scipy/pull/71

And found that there may be some restrictions that prevent practical usage in the open source tier: RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 1.70 GB). Tried to allocate 0 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Do you have any experience/guidance here? Is this expected? Is this disabled and you don't want us trying it? It would be very cool to be able to flush through GPUs in CI like that!

tylerjereddy avatar May 16 '23 17:05 tylerjereddy

I ran the following Pytorch example inside a Tart VM and indeed it seems not supported by underlying Virtualization.framework. Seems it's not supported yet but hopefully there will be some news on WWDC in two weeks. 🤞

fkorotkov avatar May 16 '23 19:05 fkorotkov

Thanks, this would be pretty cool!

tylerjereddy avatar May 16 '23 19:05 tylerjereddy

With a little bit more investigation it seems the Virtualization.Framework should support Metal. It's mentioned in the last years WWDC video on 10:53. There is even ParavirtualizedGraphics.Framework that predates Virtualization.Framewerk which allegedly should use it.

But in my testing I don't see any graphics devices inside the VM:

Screenshot 2023-05-16 at 4 14 33 PM

Comparing to what I see on an M1 Mac Mini:

Screenshot 2023-05-16 at 4 12 36 PM

@edigaryev I know you diged into private APIs of Virtualization.Framework. Have you seem maybe any mentions of Metal?

fkorotkov avatar May 16 '23 20:05 fkorotkov

@fkorotkov the paravirtualization actually seems to be used:

Screenshot 2023-05-17 at 12 56 33

You can also check this by running ioreg inside of a VM:

% ioreg -n AppleParavirtGPU -r
+-o AppleParavirtGPU  <class AppleParavirtGPU, id 0x100000191, registered, matched, active, busy 0 (1 ms), retain 13>
  | {
  |   "IOClass" = "AppleParavirtGPU"
  |   "KDebugVersion" = 4294967296
  |   "IOPersonalityPublisher" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IOMatchedAtBoot" = Yes
  |   "IOReportLegendPublic" = Yes
  |   "AGCInfo" = {"fLastSubmissionPID"=134,"fSubmissionsSinceLastCheck"=0,"fBusyCount"=0}
  |   "IOProviderClass" = "AppleARMIODevice"
  |   "MetalPluginName" = "AppleParavirtGPUMetalIOGPUFamily"
  |   "IOProbeScore" = 0
  |   "SurfaceList" = ()
  |   "IONameMatch" = "paravirtualizedgraphics,gpu"
  |   "MetalPluginClassName" = "AppleParavirtDevice"
  |   "SchedulerState" = {"Stamps"=(),"BusyWorkQueues"=()}
  |   "CFBundleIdentifierKernel" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IOMatchCategory" = "IOAcceleratorES"
  |   "CFBundleIdentifier" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IONameMatched" = "paravirtualizedgraphics,gpu"
  |   "PerformanceStatistics" = {"recoveryCount"=0,"In use system memory"=108962304,"Alloc system memory"=52527104}
  |   "IOGeneralInterest" = "IOCommand is not serializable"
  |   "IOReportLegend" = ({"IOReportChannels"=((1,6442450945,"Alloc system memory"),(2,6442450945,"In use system memory"),(3,6442450945,"GPU Restart Count")),"IOReportGroupName"="Internal Statistics","IOReportChan$
  |   "DisplayPortCount" = 1
  | }
  | 
  +-o AppleParavirtDisplay  <class AppleParavirtDisplay, id 0x1000001df, registered, matched, active, busy 0 (0 ms), retain 9>
  | +-o IOMobileFramebufferUserClient  <class IOMobileFramebufferUserClient, id 0x100000285, !registered, !matched, active, busy 0, retain 5>
  | +-o IOMobileFramebufferUserClient  <class IOMobileFramebufferUserClient, id 0x100000286, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x100000294, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x100000353, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000035a, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000035d, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000036a, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x1000003fa, !registered, !matched, active, busy 0, retain 5>

I'm not sure as to why Apple’s Metal Performance Shaders don't work, though.

edigaryev avatar May 17 '23 09:05 edigaryev

Perhaps @Developer-Ecosystem-Engineering might be able to (informally) point us in the right direction? I know they've been quite helpful with NumPy low-level development on M-series chips.

tylerjereddy avatar May 31 '23 13:05 tylerjereddy

I am running into the same issue as well.

gluefox avatar May 08 '24 10:05 gluefox

Its currently not supported to run these types of workloads under virtualization.framework.

We understand the request!