MIScnn icon indicating copy to clipboard operation
MIScnn copied to clipboard

How to run with multi-GPU?

Open ssean819 opened this issue 5 years ago • 8 comments

Hi, I want to try to run multi-GPU. But when I set GPU number bigger than 1. It output

Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.

And training would stop in the epoch 1.

It seems we need to use MirroredStrategy for multi GPU now. https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy

Will the next version update with MirroredStrategy? I am finding a way to modify code with the use of MirroredStrategy instead.

Best regard.

ssean819 avatar Sep 27 '20 17:09 ssean819

Hey @ssean819,

you are absolutely right. Thank you for spotting this deprecated functionality.

I will replace the Keras multi_gpu model with the Tensorflow MirroredStrategy and release it in the next update when its tested & ready.

Cheers, Dominik

Tasks

  • [x] Replaced Keras multi_gpu with Tensorflow MirroredStrategy
  • [x] Implemented unittesting for multi-GPU
  • [x] Tested new feature
  • [x] Changed parameter of Neural_Network class from to multi_gpu (boolean)
  • [x] Updated wiki and old example code
  • [x] Merged dev branch into Master
  • [x] Release new PyPI version

Related Commits: 1eb0a95d345a15f409e5ea764709893deb6a627c, a36716c8cc287b6e387101fbe7aed7e08c831216, f70d2b5c8368a0f52181495cea100243ea6a1cf2

Notes

You can now use MirroredStrategy in MIScnn if you run something like this:

# Multi GPU utilization
nn = Neural_Network(preprocessor=pp, multi_gpu=True)
nn.train(self.sample_list2D, epochs=3)

muellerdo avatar Oct 01 '20 13:10 muellerdo

Hi, Thank you a lot for update multi-GPU function. But when I try to install miscnn1.1.0. It seems missing some files. The problem is below.

Collecting miscnn
  Using cached miscnn-1.1.0.tar.gz (55 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\sean\anaconda3\envs\py3.8\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"'; __file__='"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\sean\AppData\Local\Temp\pip-pip-egg-info-37htbhms'
         cwd: C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\setup.py", line 5, in <module>
        with open("docs/README.PyPI.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'docs/README.PyPI.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I think the problem is gz is not converted to whl.

ssean819 avatar Oct 06 '20 06:10 ssean819

Hi @muellerdo I think maybe someone would have NCCL problem when using multi-GPU. error info is like below error: No OpKernel was registered to support Op 'NcclAllReduce'

Because tf.distribute.MirroredStrategy() uses NCCL in default. We can change to tf.distribute.mirrorstrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()) This solves no NCCL problem.

ssean819 avatar Oct 06 '20 07:10 ssean819

Hey @ssean819,

But when I try to install miscnn1.1.0. It seems missing some files.

You are right. The wheel was missing on PyPI for some reasons :O I uploaded it again and it should work now.

I think maybe someone would have NCCL problem when using multi-GPU. error info is like below

Thanks for the feedback! Will be changed to HierarchicalCopyAllReduce in the next version.

Cheers, Dominik

Tasks

  • [x] Change NCCL to HierarchicalCopyAllReduce for Mirrored Strategy
  • [x] Tested locally and on TravisCI node
  • [x] Merged branch into Master
  • [x] Release new PyPI version

Related Commits: 68eb07dd80fd5bb2f98dc8a2d07134dbe8dc3be6

muellerdo avatar Oct 07 '20 16:10 muellerdo

Hi @muellerdo

Now I test with multi-GPU occur this problem.

F .\tensorflow/core/kernels/conv_2d_gpu.h:1021] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument

It seems is tensorflow's problem, I am not sure. Do you know how to fix this? I am trying to find a solution.

ssean819 avatar Oct 15 '20 05:10 ssean819

Hi @ssean819,

you are correct. This is a Tensorflow issue. Sadly I'm unfamiliar with this error.

Nevertheless, these two issues suggest that it could has something to do with:

  • Upgrading/Downgrading Tensorflow/cuDNN/CUDA version -> https://stackoverflow.com/questions/63258022/non-ok-status-gpulaunchkernel-status-internal-no-kernel-image-is-availab
  • Odd batch numbers when dividing with the number of gpus -> https://github.com/tensorflow/tensorflow/issues/36310

I tried to reproduce the error when using odd batch numbers with 3 gpus (batch size 10), but it works fine for me on the latest stable tensorflow docker image and 3x NVIDIA TITAN RTX. Are you working on a Windows system?

Cheers, Dominik

muellerdo avatar Oct 15 '20 11:10 muellerdo

When I tried the MIScnn sample example (LCTSC) with multi-GPU option 'on' (Neural_Network(multi_gpu=True)), I got the following message right before the Epoch 1 and the kernel restarted. Then you cannot run it anymore. There is no modification in the sample code except the multi-GPU option. Is there any solution for using multi-GPU in MIScnn? I am using A100 GPUs with the latest versions of MIScnn, CUDA, and cuDNN. Thank you!!

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options). I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999955000 Hz

Epoch 1/100 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1

Kernel Restarting - The kernel for LCTSC.ipynb appears to have died. It will restart automatically.

tslee69 avatar Jul 20 '21 03:07 tslee69

@tslee69, seems like tensorflow added some more issues since version 2.4.0 to its multi-gpu support for keras :/

Check out this:

  • https://stackoverflow.com/questions/65322700/tensorflow-keras-consider-either-turning-off-auto-sharding-or-switching-the-a
  • https://github.com/tensorflow/tensorflow/commit/92bd8e1034086d28b7e47d1a523caa452bacd06a

muellerdo avatar Jul 20 '21 13:07 muellerdo