convnet-benchmarks icon indicating copy to clipboard operation
convnet-benchmarks copied to clipboard

Enable XLA support for Tensorflow

Open Randl opened this issue 7 years ago • 2 comments

XLA can significantly increase computation speed.

I tried to measure speed up, but unfortunately didn't manage to get significant results:

$ python3 benchmark_vgg.py --batch_size 4000
WARNING:tensorflow:From benchmark_vgg.py:184: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
2017-02-20 17:14:06.476874: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-02-20 17:14:06.476908: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-02-20 17:14:08.569973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:04:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-02-20 17:14:08.570458: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x3eccad0
2017-02-20 17:14:09.183220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Graphics Device
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:41:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-02-20 17:14:09.183512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
2017-02-20 17:14:09.183570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
2017-02-20 17:14:09.183633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
2017-02-20 17:14:09.183658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N
2017-02-20 17:14:09.183675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y
2017-02-20 17:14:09.183905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:04:00.0)
2017-02-20 17:14:09.184090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Graphics Device, pci bus id: 0000:41:00.0)
2017-02-20 17:14:09.749515: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-02-20 17:14:09.749669: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 96 visible devices
2017-02-20 17:14:09.794250: I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
2017-02-20 17:14:09.794375: I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-02-20 17:14:09.794871: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-02-20 17:14:09.794890: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 96 visible devices
2017-02-20 17:14:09.826939: I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
2017-02-20 17:14:09.827028: I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): Graphics Device, Compute Capability 6.0
2017-02-20 17:14:09.827054: I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (1): Graphics Device, Compute Capability 6.0
2017-02-20 17:14:14.149286: step 0, duration = 0.000
2017-02-20 17:14:14.152526: step 10, duration = 0.000
2017-02-20 17:14:14.155913: step 20, duration = 0.000
2017-02-20 17:14:14.158968: step 30, duration = 0.000
2017-02-20 17:14:14.161953: step 40, duration = 0.000
2017-02-20 17:14:14.165289: step 50, duration = 0.001
2017-02-20 17:14:14.168046: step 60, duration = 0.000
2017-02-20 17:14:14.172249: step 70, duration = 0.000
2017-02-20 17:14:14.174981: step 80, duration = 0.000
2017-02-20 17:14:14.177259: step 90, duration = 0.000
2017-02-20 17:14:14.179223: Forward across 100 steps, 0.000 +/- 0.000 sec / batch
2017-02-20 17:14:15.127072: step 0, duration = 0.006
2017-02-20 17:14:15.193918: step 10, duration = 0.006
2017-02-20 17:14:15.258036: step 20, duration = 0.006
2017-02-20 17:14:15.311999: step 30, duration = 0.006
2017-02-20 17:14:15.364200: step 40, duration = 0.005
2017-02-20 17:14:15.416405: step 50, duration = 0.005
2017-02-20 17:14:15.470125: step 60, duration = 0.006
2017-02-20 17:14:15.508636: step 70, duration = 0.003
2017-02-20 17:14:15.542784: step 80, duration = 0.003
2017-02-20 17:14:15.576780: step 90, duration = 0.003
2017-02-20 17:14:15.607214: Forward-backward across 100 steps, 0.005 +/- 0.001 sec / batch

(I've used P100 for these measurements)

Randl avatar Feb 20 '17 15:02 Randl

Could you post your benchmark code?

aodhan-domhnaill avatar Mar 06 '17 22:03 aodhan-domhnaill

  config = tf.ConfigProto()
  
  # Turns on XLA JIT compilation.
  config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
  run_metadata = tf.RunMetadata()
  sess = tf.Session(config=config)
  tf.global_variables_initializer().run(session=sess)

I've added theses lines to enable XLA

Randl avatar Mar 07 '17 06:03 Randl