benchmarks
benchmarks copied to clipboard
Benchmark performance drops significantly when using map_and_batch
After taking the latest benchmarks, we noticed a drop in performance on models inception3 and resnet152. Testing with TensorFlow r1.5 on 32xP100 GPUs (8 servers), imagenet data, batch size 64.
Inception3:
- grpc: 3350 ==> 3000
- grpc + verbs: 3800 ==> 3150
Resnet152:
- grpc: 2050 ==> 2000
- grpc + verbs: 2450 ==> 2250
We isolated the 'problematic' change to: https://github.com/tensorflow/benchmarks/commit/82dd0539c76afa8491e50d8f796e686b4d97b988#diff-3269d1838b2ebc9c6c071802fb946ca1R521
After replacing the specific call to map_and_batch()
, with the previous call to map()
with 16 parallel calls (https://github.com/Mellanox/benchmarks/commit/56e0b2298f835905f7d8a53c5bf482ed1dce55fd), we get high numbers again. We don't have a theory to explain this.
Thanks
I will have someone take a look, it might also impact other tests. I finally got nightly tests up (in the last week maybe) but I do not have anything with distributed only multi-GPU on DGX-1s.
Thank you for linking to the change in question.
@yanivbl @shamoya @shimonran @tfboyd Thanks! I guess we can assist in distributed testing if a patch is available. (I'll need to schedule this with my supervisors, as our lab is currently very busy).
Hi @tfboyd ,
I have faced the similar issue. while moving from benchmarks git commit f5d85ae to 82dd053 there is significant performance drops for the following change. I am using 4 GPUs and batch-size 64 per GPU.
https://github.com/tensorflow/benchmarks/compare/f5d85ae...82dd053#diff-3269d1838b2ebc9c6c071802fb946ca1R522
If I look into NV profile, as an effect, it shows that data transfer from CPU to GPU is taking longer time for the same amount of data.
This performance issue is there in the master branch too. To get the performance back do I need to go back to commit f5d85ae? OR Are we planning to fix this in the master branch?
Thanks.
Sorry, I got distracted. @reedwm Can you take a look at the diff? We still do not have a OSS distributed test to verify this externally but if the change does not impact the multi-gpu (single node) then maybe we can do a roll back. There was an offer to test a patch for us if we can provide a PR or branch.
Same problem here, if i use batching.map_and_batch, it's much slower than first batch and then map, example code: if use_map_and_batch: #x: serialized_example, y: index in current batch dataset = dataset.apply( batching.map_and_batch(map_func=lambda x, y : parse_fn(x, batch_pos=y), batch_size=batch_size_per_split, num_parallel_batches=self.num_splits)) else: dataset = dataset.batch(self.batch_size) dataset = dataset.map(lambda x: parse_fn(x), num_parallel_calls=self.num_data_mapper)
i found one reason is: parse_example(...) is much faster then pasrse_single_example(...)
I have a certain theory, not sure if it's correct.
Looking at the worker CPU utilization graphs, it is possible that the increased parallelism of MapAndBatch(), while making the pre-processing finish faster, actually steals resources from the worker's CPU processing thread (because the pre-processing is now utilizing all of the cores).
If I am correct, than the peak of CPU utilization at the start of the graph is the preprocessing, and the trailing tail is the worker's CPU processing.
From what I see, the CPU processing is in fact the trigger to finish the step, and the preprocessing is far from being a bottleneck. Also I note that the processing thread does not require a lot of CPU most of the time, but perhaps it requires a little more on step's start.
@reedwm , Is there any progress to resolve this issue?
Not yet, but I hope to look at it soon.
@eladweiss, thank you for your analysis! In benchmark_cnn.py, we set the env var TF_GPU_THREAD_MODE to gpu_private, which gives each GPU two dedicated threads. We do this because we observed exactly what you described: preprocessing threads steal resources from the non-preprocessor threads, delaying the scheduling of GPU kernels, and hence delaying the GPU from doing work.
From your analysis, it seems the issue is probably still occurring. Perhaps now, preprocessing threads are stealing resources from CPU ops instead of GPU ops. I will try to look into this soon.