pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

[FEATURE] TPU training / validation support and train / val code refactor.

Open rwightman opened this issue 3 years ago • 6 comments

Goal is to adapt timm training and validation scripts to work well with PyTorch XLA on TPU/TPU Pods (maybe CPU/GPU), and PyTorch on GPU.

As part of this I will also refactor the training / validation loop, step fn, cuda/xla/distributed abstractions.

The aim is to keep it lean and fairly flat while breaking it down into some reusable/swapable components. There will be a bit more structure, but only a step in the direction of what fastai, lightning, ignite, etc offer.

rwightman avatar Feb 25 '21 23:02 rwightman

@rwightman If you want, I am willing to help out with any efforts regarding adapting code to PyTorch XLA, having used PyTorch XLA many times in the past...

tmabraham avatar Feb 28 '21 02:02 tmabraham

@tmabraham thanks, might take you up on that, currently thinking through the abstracitons, trying to hide most of cuda + distributed config vs xla + distributed config without making too many levels of abstraction.

My biggest concern is with the current 2 VM TPU architecture, how to feed the TPU VMs with data without needing to spin up one n1-96 nodes per 8 tpu as seems to be suggested to prevent the data preprocessing from holding back the TPU. Have you spent any time optmizing the infeed or tinkering with balancing pre-processing between the GCE feeding VM and the TPU host VM?

rwightman avatar Feb 28 '21 19:02 rwightman

Unfortunately, I haven't spent too much time with optimizations regarding the passing of data from the VM to the TPU.

to prevent the data preprocessing from holding back the TPU I would suggest that the data preprocessing should be already done and it's the preprocessed data that we load onto the TPU. For example, I had an NLP dataset that I trained an XLM-RoBERTa-large on using PyTorch XLA, and I already did the tokenization and saved as numpy files which I loaded as out-of-memory datasets.

Regarding things like augmentations, I think the best approach is to do batch augmentations on the TPU. I think this is already standard practice for TPU usage with TensorFlow. I know batch augmentations isn't something that's typically done in PyTorch though, but fastai's implementation of that might be worth a look...

I am assuming you want to do ImageNet training, correct? There is an example over here and a tutorial over here.

tmabraham avatar Feb 28 '21 22:02 tmabraham

@tmabraham thanks, might take you up on that, currently thinking through the abstracitons, trying to hide most of cuda + distributed config vs xla + distributed config without making too many levels of abstraction.

My biggest concern is with the current 2 VM TPU architecture, how to feed the TPU VMs with data without needing to spin up one n1-96 nodes per 8 tpu as seems to be suggested to prevent the data preprocessing from holding back the TPU. Have you spent any time optmizing the infeed or tinkering with balancing pre-processing between the GCE feeding VM and the TPU host VM?

They have made 1VM publicly available. Not sure if TFRC could sponsor that though.

byronyi avatar Aug 26 '21 09:08 byronyi

@byronyi this thread is a little outdated but shortly afterwards Ross got access to the 1VM setup and has been able to set up and train models on TPUs with Pytorch XLA.

Yes, the TRC program is the best way to get free access to TPU VMs to experiment with.

tmabraham avatar Aug 26 '21 18:08 tmabraham

@byronyi as per @tmabraham's comment, I've been chugging along with some updated code that works with PyTorch XLA, it's on a diffeferent branch of this repository https://github.com/rwightman/pytorch-image-models/tree/bits_and_tpu/timm/bits#readme

I've been juggling a few other sets of experiments and projects but have been slowly evolving that code. It's at the point where I'm regularly running training on TPU VM v3-8 nodes and it's matching the master training code in output quality. Still lots to do on that bits code though, it will be more flexible than the current master training code. I'm in the TRC program.

rwightman avatar Aug 26 '21 19:08 rwightman