benchmarks Benchmark performance drops significantly when using map_and

After taking the latest benchmarks, we noticed a drop in performance on models inception3 and resnet152. Testing with TensorFlow r1.5 on 32xP100 GPUs (8 servers), imagenet data, batch size 64.

Inception3:

grpc: 3350 ==> 3000
grpc + verbs: 3800 ==> 3150

Resnet152:

grpc: 2050 ==> 2000
grpc + verbs: 2450 ==> 2250

We isolated the 'problematic' change to: https://github.com/tensorflow/benchmarks/commit/82dd0539c76afa8491e50d8f796e686b4d97b988#diff-3269d1838b2ebc9c6c071802fb946ca1R521

After replacing the specific call to map_and_batch(), with the previous call to map() with 16 parallel calls (https://github.com/Mellanox/benchmarks/commit/56e0b2298f835905f7d8a53c5bf482ed1dce55fd), we get high numbers again. We don't have a theory to explain this.

Thanks

Feb 08 '18 09:02 eladweiss

I will have someone take a look, it might also impact other tests. I finally got nightly tests up (in the last week maybe) but I do not have anything with distributed only multi-GPU on DGX-1s.

Thank you for linking to the change in question.

Feb 08 '18 16:02 tfboyd

@yanivbl @shamoya @shimonran @tfboyd Thanks! I guess we can assist in distributed testing if a patch is available. (I'll need to schedule this with my supervisors, as our lab is currently very busy).

Feb 11 '18 07:02 eladweiss

Hi @tfboyd ,

I have faced the similar issue. while moving from benchmarks git commit f5d85ae to 82dd053 there is significant performance drops for the following change. I am using 4 GPUs and batch-size 64 per GPU.

https://github.com/tensorflow/benchmarks/compare/f5d85ae...82dd053#diff-3269d1838b2ebc9c6c071802fb946ca1R522

If I look into NV profile, as an effect, it shows that data transfer from CPU to GPU is taking longer time for the same amount of data.

This performance issue is there in the master branch too. To get the performance back do I need to go back to commit f5d85ae? OR Are we planning to fix this in the master branch?

Thanks.

Feb 20 '18 11:02 asispatra

Sorry, I got distracted. @reedwm Can you take a look at the diff? We still do not have a OSS distributed test to verify this externally but if the change does not impact the multi-gpu (single node) then maybe we can do a roll back. There was an offer to test a patch for us if we can provide a PR or branch.

Feb 22 '18 21:02 tfboyd

Same problem here, if i use batching.map_and_batch, it's much slower than first batch and then map, example code: if use_map_and_batch: #x: serialized_example, y: index in current batch dataset = dataset.apply( batching.map_and_batch(map_func=lambda x, y : parse_fn(x, batch_pos=y), batch_size=batch_size_per_split, num_parallel_batches=self.num_splits)) else: dataset = dataset.batch(self.batch_size) dataset = dataset.map(lambda x: parse_fn(x), num_parallel_calls=self.num_data_mapper)

i found one reason is: parse_example(...) is much faster then pasrse_single_example(...)

Feb 24 '18 06:02 anpark

I have a certain theory, not sure if it's correct.

Looking at the worker CPU utilization graphs, it is possible that the increased parallelism of MapAndBatch(), while making the pre-processing finish faster, actually steals resources from the worker's CPU processing thread (because the pre-processing is now utilizing all of the cores).

If I am correct, than the peak of CPU utilization at the start of the graph is the preprocessing, and the trailing tail is the worker's CPU processing.

From what I see, the CPU processing is in fact the trigger to finish the step, and the preprocessing is far from being a bottleneck. Also I note that the processing thread does not require a lot of CPU most of the time, but perhaps it requires a little more on step's start.

Feb 26 '18 15:02 eladweiss

@reedwm , Is there any progress to resolve this issue?

Mar 08 '18 06:03 asispatra

Not yet, but I hope to look at it soon.

@eladweiss, thank you for your analysis! In benchmark_cnn.py, we set the env var TF_GPU_THREAD_MODE to gpu_private, which gives each GPU two dedicated threads. We do this because we observed exactly what you described: preprocessing threads steal resources from the non-preprocessor threads, delaying the scheduling of GPU kernels, and hence delaying the GPU from doing work.

From your analysis, it seems the issue is probably still occurring. Perhaps now, preprocessing threads are stealing resources from CPU ops instead of GPU ops. I will try to look into this soon.

Mar 08 '18 19:03 reedwm

benchmarks
benchmarks copied to clipboard

Benchmark performance drops significantly when using map_and_batch

Inception3:

Resnet152:

benchmarks benchmarks copied to clipboard

Benchmark performance drops significantly when using map_and_batch

Inception3:

Resnet152:

benchmarks
benchmarks copied to clipboard