ignite icon indicating copy to clipboard operation
ignite copied to clipboard

enable automatic mixed precision for xla

Open ydcjeff opened this issue 3 years ago • 8 comments

Feature

Automatic mixed precision for xla has landed in pytorch 1.8.1 and torch/xla nightly. We should enable it in create_supervised_* helper functions.

Suggested solution

Remove xla and amp checks in _check_arg().

  • For create_supervised_trainer, update supervised_training_step_tpu() function to accept scaler argument just like supervised_training_step_amp().
  • For create_supervised_evaluator, just removing xla and amp checks in _check_arg() should work.
  • For tests, we could remove xla checks and only run with pytorch 1.8.1.

Additional context

This feature should not be included in ignite release until the next torch and xla comes out.

ydcjeff avatar Apr 12 '21 08:04 ydcjeff

Is it for GPUs or TPUs as well ? I saw that and was thinking that it is for GPUs only , @ydcjeff can you check that on colab please ?

vfdev-5 avatar Apr 12 '21 08:04 vfdev-5

I would like to work on this issue

01-vyom avatar Sep 09 '21 02:09 01-vyom

@01-vyom Hi ! Thank you for your help !

IMO a good starting point should be to check the @vfdev-5 remark above on colab.

sdesrozis avatar Sep 09 '21 05:09 sdesrozis

I check it out on the Colab for TPUs : Colab, it works on TPU. Code used from the following issue: https://github.com/pytorch/pytorch/issues/61804

I tried to test on GPU but I am not able to match the versions of pytorch-xla and pytorch cuda.

Also, multiple developers on xla and pytorch suggest that AMP will run on GPU only as TPU doesn't support float 16 [Which it does, according to my tests]. https://github.com/pytorch/pytorch/pull/48570#discussion_r536282158

Moreover, the tests that they have included for autocast only runs with XLA:GPU and XLA:CPU: https://github.com/pytorch/xla/blob/81da600883f0d6342b19749cc08be18b8decc051/test/test_train_mp_imagenet.py#L30-L33

https://github.com/pytorch/xla/pull/3089

So, I think it works for both GPU as well as TPU, but their codebase only shows/acknowledges CPU and GPU.

01-vyom avatar Sep 12 '21 18:09 01-vyom

@01-vyom thanks a lot for the tests and the feedback !

For TPUs can you please check on Colab updated ignite code (tpu + amp) on CIFAR10 dataset if it trains and faster then without amp ?

As for GPUs, maybe if we could clearly understand how to install xla with GPU support on an infrastructure with GPUs (using docker), we could test that from our side as well. In this case, maybe we could have more luck with matching versions. Example how to install locally torch_xla with CPU support: https://github.com/pytorch/ignite/blob/master/.github/workflows/tpu-tests.yml#L52-L61

vfdev-5 avatar Sep 13 '21 22:09 vfdev-5

@vfdev-5 Can I work on this issue?

Zekrom-7780 avatar Nov 10 '23 14:11 Zekrom-7780

@Zekrom-7780 yes but this can be a but tricky I would expect. Try to think about it and let's discuss the plan here or on discord.

vfdev-5 avatar Nov 10 '23 16:11 vfdev-5

Hey 👋, I've just created a thread for this issue on PyTorch-Ignite Discord where you can quickly talk to the community on the topic.

🤖 This comment was automatically posted by Discuss on Discord

github-actions[bot] avatar Nov 10 '23 16:11 github-actions[bot]