dsub Feature request TPU v4 support

With tpu v4, google has really cleaned up the user experience around tpu vms. Does the google-cls-v2 provider allow provisioning of tpu v4 machines? If so, is there any example that can please be shown that illustrates provisioning, and loading up any drivers to make the tpu v4 accelerator types visible to jobs submitted via dsub.

Oct 20 '22 09:10 rivershah

Thanks @rivershah ! This is something I'll be looking into more but looking at some of the TPU documentation, it sounds like something we should be able to support - but possibly there might be some Life Sciences API changes needed.

Would you mind clarifying what you mean by how google "cleaned up" the experience? I'd be interested to hear about your experience. Is there any specific documentation you follow for your work with TPUs?

Oct 21 '22 00:10 wnojopra

@wnojopra Please take a look here: https://www.youtube.com/watch?v=W7A-9MYvPwI&t=301s

Now tpus follow the same provisioning model as gpus. Root access to host vm with the accelerators on the host. I am not sure how relevant this provisioning model refactor is as far as using the Life Sciences AP, but seems like a unification with existing gpu provisioning model which works very nicely with dsub

Oct 21 '22 08:10 rivershah

Thanks for sharing. Wanted to highlight an important bit from the video for others to read:

""" In the past, for using TPUs on Cloud, a network attached architecture was used. The user would connect to a VM and then interact with the TPUs through GRPC calls. This was difficult to debug and sometimes introduced delays in the experience. With all new TPU VM Architecture, you have root access to every TPU VM you create. So you can install and run any software you wish in a tight loop with your TPU accelerators. You can use local storage, execute custom code in your input pipelines, and more easily integrate Cloud TPUs into your research and production workflows. """

Oct 24 '22 17:10 wnojopra