CTranslate2
CTranslate2 copied to clipboard
faster_whisper batch operation time-consuming
Is this related to CTranslate?
The following is copied from this . I have made a test, for batching in faster-whisper.
But faster_whisper batch encode consume multiple time as sample's amount, it seems encode in batch not work as expected by CTranslate?
By the way, the decode/generate have the same issue when in-Batch ops.
Can you move the input features to the model device before measuring the execution time ?
I ran some test, the inference speed results are suspicious for encode but not for generate. One is using post method : https://github.com/OpenNMT/CTranslate2/blob/4f8a4f334c59588223b6f1f24b707d7e8d5fe08c/src/models/whisper.cc#L656
The other is using post_batch: https://github.com/OpenNMT/CTranslate2/blob/4f8a4f334c59588223b6f1f24b707d7e8d5fe08c/src/models/whisper.cc#L667
Here is Guillaume's PR where the encode method was implemented. I don't know why Guillaume did not use post_batch though. Maybe the encode method only relies on cuda code that should be efficient on a tensor with batches compared to generate that is running high level logic ? https://github.com/OpenNMT/CTranslate2/pull/1144
I can blindly try to use post_batch.
@vince62s any thoughts on that ? Or can you tag someone who could help on ct2?
Can you move the input features to the model device before measuring the execution time ?
10 samples.
Before changes
def get_ctranslate2_storage(segment: np.ndarray) -> ctranslate2.StorageView:
segment = np.ascontiguousarray(segment)
segment = ctranslate2.StorageView.from_array(segment)
return segment
prepare features cost: 0.527921199798584
get_ctranslate2_storage cost 0.013983964920043945 seconds
self.model.encode [10, 128, 3000] cost 5.466844797134399 seconds
encoder_output shape:[10, 1500, 1280]
encode cost: 5.481068134307861
changes
def get_ctranslate2_storage(segment: np.ndarray) -> ctranslate2.StorageView:
segment = np.ascontiguousarray(segment)
segment = torch.from_numpy(segment).float().to("cuda:0") #added line.
segment = ctranslate2.StorageView.from_array(segment)
return segment
prepare features cost: 0.47109270095825195
get_ctranslate2_storage cost 0.6899769306182861 seconds
self.model.encode [10, 128, 3000] cost 5.4121253490448 seconds
encoder_output shape:[10, 1500, 1280]
encode cost: 6.102340459823608
Conclusion:
- move features from cpu to cuda:0 (where the model is) in get_ctranslate2_storage , leads to 0.0139 sec -> 0.6899 sec, and encode 5.4668 sec -> 5.4121 sec.
- So, the problem is still from CT2 encode , not copy features between CPU and GPU.
Here is a more detailed profiling with whisper tiny :
45.09% 77.56% 45.09% MultiHeadAttention 13.81ms
14.73% 14.73% 59.82% BiasAdd 4.51ms
12.79% 27.28% 72.61% dot_product_attention 3.92ms
7.45% 7.45% 80.06% SoftMax 2.28ms
7.04% 7.04% 87.10% MatMul 2.16ms
6.88% 21.61% 93.98% Gemm 2.11ms
2.84% 2.84% 96.83% Conv1D 0.87ms
0.79% 0.79% 97.62% Transpose 0.24ms
0.64% 0.64% 98.26% LayerNorm 0.20ms
0.50% 0.50% 98.76% Split 0.15ms
0.37% 0.37% 99.13% Add 0.11ms
0.27% 96.59% 99.40% TransformerEncoderLayer 0.08ms
0.20% 99.90% 99.60% WhisperEncoder 0.06ms
0.17% 21.78% 99.76% Dense 0.05ms
0.13% 0.13% 99.90% GELU 0.04ms
0.164976 0.107653 0.0736297 ... -0.023933 -0.588876 -0.235716
Batch size: 1, Execution time: 0.035004863739013674 seconds, Execution time per batch: 0.035004863739013674 seconds
35.34% 50.61% 35.34% dot_product_attention 17.50ms
27.67% 91.61% 63.01% MultiHeadAttention 13.70ms
14.65% 15.88% 77.66% Gemm 7.25ms
7.84% 7.84% 85.51% SoftMax 3.88ms
7.42% 7.42% 92.93% MatMul 3.67ms
2.03% 2.03% 94.95% Conv1D 1.00ms
1.25% 97.56% 96.21% TransformerEncoderLayer 0.62ms
1.22% 1.22% 97.43% BiasAdd 0.61ms
0.75% 0.75% 98.18% Transpose 0.37ms
0.59% 0.59% 98.77% LayerNorm 0.29ms
0.44% 0.44% 99.21% Split 0.22ms
0.38% 0.38% 99.59% Add 0.19ms
0.14% 99.95% 99.73% WhisperEncoder 0.07ms
0.12% 15.99% 99.84% Dense 0.06ms
0.11% 0.11% 99.95% GELU 0.06ms
0.164976 0.107653 0.0736297 ... -0.0239328 -0.588876 -0.235715
Batch size: 2, Execution time: 0.058016990661621096 seconds, Execution time per batch: 0.029008495330810548 seconds
43.45% 92.19% 43.45% MultiHeadAttention 40.96ms
18.09% 32.20% 61.55% dot_product_attention 17.05ms
9.94% 9.94% 71.49% Transpose 9.37ms
8.76% 9.83% 80.25% Gemm 8.26ms
7.78% 7.78% 88.03% SoftMax 7.34ms
6.33% 6.33% 94.36% MatMul 5.96ms
2.64% 2.64% 97.00% Conv1D 2.49ms
1.07% 1.07% 98.07% BiasAdd 1.01ms
0.56% 97.08% 98.63% TransformerEncoderLayer 0.53ms
0.44% 0.44% 99.07% LayerNorm 0.41ms
0.35% 0.35% 99.42% Add 0.33ms
0.35% 0.35% 99.77% Split 0.33ms
0.10% 0.10% 99.86% GELU 0.09ms
0.08% 99.99% 99.94% WhisperEncoder 0.07ms
0.05% 9.88% 99.99% Dense 0.04ms
0.164977 0.107653 0.07363 ... -0.0239328 -0.588876 -0.235715
Batch size: 4, Execution time: 0.09641846466064453 seconds, Execution time per batch: 0.024104616165161134 seconds
31.35% 86.33% 31.35% MultiHeadAttention 47.57ms
26.44% 44.62% 57.79% dot_product_attention 40.13ms
10.99% 10.99% 68.78% SoftMax 16.68ms
8.05% 9.28% 76.83% Gemm 12.22ms
7.19% 7.19% 84.02% MatMul 10.91ms
4.52% 97.61% 88.54% TransformerEncoderLayer 6.86ms
2.79% 2.79% 91.33% Transpose 4.24ms
2.39% 2.39% 93.72% Split 3.63ms
2.35% 2.35% 96.08% LayerNorm 3.57ms
2.08% 2.08% 98.16% Conv1D 3.16ms
1.23% 1.23% 99.39% BiasAdd 1.87ms
0.38% 0.38% 99.77% Add 0.57ms
0.11% 0.11% 99.88% GELU 0.17ms
0.07% 99.98% 99.96% WhisperEncoder 0.11ms
0.03% 9.31% 99.98% Dense 0.04ms
0.164976 0.107653 0.0736298 ... -0.0239328 -0.588876 -0.235715
Batch size: 8, Execution time: 0.1561301727294922 seconds, Execution time per batch: 0.019516271591186525 seconds
32.22% 88.50% 32.22% MultiHeadAttention 104.86ms
27.55% 48.49% 59.77% dot_product_attention 89.65ms
11.62% 11.62% 71.39% MatMul 37.82ms
9.33% 9.33% 80.72% SoftMax 30.36ms
6.91% 8.00% 87.62% Gemm 22.49ms
4.16% 97.41% 91.79% TransformerEncoderLayer 13.55ms
1.69% 1.69% 93.48% Transpose 5.51ms
1.69% 1.69% 95.17% Conv1D 5.50ms
1.35% 1.35% 96.52% Split 4.38ms
1.23% 1.23% 97.75% LayerNorm 4.02ms
1.09% 1.09% 98.84% BiasAdd 3.56ms
0.72% 99.99% 99.56% WhisperEncoder 2.33ms
0.32% 0.32% 99.89% Add 1.05ms
0.09% 0.09% 99.98% GELU 0.30ms
0.01% 8.02% 99.99% Dense 0.04ms
0.164977 0.107653 0.07363 ... -0.023933 -0.588876 -0.235715
Batch size: 16, Execution time: 0.3303467102050781 seconds, Execution time per batch: 0.020646669387817382 seconds
Here is a profiling of the CT2 generate function that gathers encoding and decoding. It seems that the decoding in beam_search is quite efficient with batching and represents the main time consuming task. On the other hand, the encoding in WhisperEncoder does not really benefit from the encoding. It is possible that there is nothing wrong inside CT2 and that there is a bottleneck in the architecture / computations that does not make batching batch_size times faster. There is still a small improvement in encoder batching, just not proportional to the batch_size.
Also there is some optimization done within the replica pool that is proceeding with a silent batching if I understand correctly, hence this silent batching could explain why ct2 whisper is fast and why manual batching does not provide that much improvement.
I will stop there as I could not find anything obviously wrong, feel free to continue investigating.
28.70% 34.79% 28.70% Gemm 104.68ms 126.92ms
11.91% 11.91% 40.61% MatMul 43.45ms 43.45ms
6.38% 57.72% 46.98% MultiHeadAttention 23.26ms 210.55ms
6.10% 6.10% 53.08% BiasAdd 22.24ms 22.24ms
5.96% 91.42% 59.04% beam_search 21.75ms 333.51ms
5.93% 5.93% 64.97% Concat 21.62ms 21.62ms
4.57% 4.57% 69.54% LayerNorm 16.67ms 16.67ms
4.49% 4.49% 74.03% TopK 16.39ms 16.39ms
4.37% 4.37% 78.40% Gather 15.93ms 15.93ms
3.69% 3.69% 82.09% SoftMax 13.45ms 13.45ms
3.06% 3.06% 85.15% Add 11.18ms 11.18ms
3.04% 18.64% 88.19% dot_product_attention 11.11ms 68.01ms
2.72% 2.72% 90.91% Transpose 9.92ms 9.92ms
2.45% 2.45% 93.37% Split 8.94ms 8.94ms
1.85% 67.98% 95.22% TransformerDecoderLayer 6.76ms 248.01ms
1.63% 76.54% 96.85% TransformerDecoder 5.94ms 279.23ms
1.46% 36.25% 98.31% Dense 5.33ms 132.25ms
1.17% 1.17% 99.48% LogSoftMax 4.28ms 4.28ms
0.23% 0.23% 99.71% Conv1D 0.83ms 0.83ms
0.07% 4.56% 99.78% BestSampler 0.24ms 16.63ms
0.06% 0.48% 99.84% Embeddings 0.23ms 1.76ms
0.05% 7.40% 99.89% TransformerEncoderLayer 0.19ms 27.00ms
0.03% 0.03% 99.92% Tile 0.12ms 0.12ms
0.02% 7.67% 99.94% WhisperEncoder 0.06ms 27.97ms
0.01% 0.01% 99.95% GELU 0.04ms 0.04ms
Batch size: 1, Execution time: 0.4774452209472656 seconds, Execution time per batch: 0.4774452209472656 seconds
25.34% 31.33% 25.34% Gemm 94.46ms 116.80ms
12.93% 12.93% 38.27% MatMul 48.19ms 48.19ms
8.95% 61.25% 47.22% MultiHeadAttention 33.36ms 228.32ms
6.17% 88.31% 53.39% beam_search 23.02ms 329.21ms
5.99% 5.99% 59.38% BiasAdd 22.34ms 22.34ms
4.57% 4.57% 63.96% Concat 17.05ms 17.05ms
4.51% 4.51% 68.47% TopK 16.81ms 16.81ms
4.47% 4.47% 72.94% LayerNorm 16.68ms 16.68ms
4.06% 4.06% 77.00% SoftMax 15.14ms 15.14ms
4.00% 4.00% 81.01% Gather 14.92ms 14.92ms
3.92% 20.91% 84.93% dot_product_attention 14.63ms 77.97ms
3.11% 3.11% 88.04% Add 11.59ms 11.59ms
2.74% 2.74% 90.77% Transpose 10.20ms 10.20ms
2.42% 2.42% 93.19% Split 9.02ms 9.02ms
1.83% 65.48% 95.02% TransformerDecoderLayer 6.82ms 244.09ms
1.59% 73.83% 96.61% TransformerDecoder 5.91ms 275.21ms
1.44% 32.78% 98.05% Dense 5.38ms 122.18ms
1.15% 1.15% 99.21% LogSoftMax 4.30ms 4.30ms
0.37% 0.37% 99.58% Conv1D 1.40ms 1.40ms
0.16% 10.06% 99.74% TransformerEncoderLayer 0.60ms 37.49ms
0.07% 4.58% 99.81% BestSampler 0.24ms 17.06ms
0.06% 0.46% 99.87% Embeddings 0.23ms 1.72ms
0.03% 0.03% 99.90% Tile 0.11ms 0.11ms
0.02% 10.49% 99.92% WhisperEncoder 0.08ms 39.09ms
0.02% 0.02% 99.94% GELU 0.07ms 0.07ms
Batch size: 2, Execution time: 0.5886483154296875 seconds, Execution time per batch: 0.29432415771484377 seconds
22.61% 27.68% 22.61% Gemm 105.30ms 128.88ms
12.63% 12.63% 35.25% MatMul 58.83ms 58.83ms
11.77% 64.48% 47.02% MultiHeadAttention 54.81ms 300.27ms
7.39% 81.07% 54.41% beam_search 34.43ms 377.52ms
5.68% 5.68% 60.09% Concat 26.44ms 26.44ms
5.63% 22.68% 65.72% dot_product_attention 26.20ms 105.63ms
5.06% 5.06% 70.78% BiasAdd 23.59ms 23.59ms
5.00% 5.00% 75.78% Gather 23.27ms 23.27ms
4.42% 4.42% 80.20% SoftMax 20.59ms 20.59ms
3.61% 3.61% 83.81% LayerNorm 16.79ms 16.79ms
3.60% 3.60% 87.41% TopK 16.77ms 16.77ms
2.54% 2.54% 89.94% Transpose 11.82ms 11.82ms
2.52% 2.52% 92.46% Add 11.71ms 11.71ms
1.95% 1.95% 94.41% Split 9.06ms 9.06ms
1.45% 58.49% 95.86% TransformerDecoderLayer 6.76ms 272.36ms
1.32% 65.23% 97.18% TransformerDecoder 6.16ms 303.75ms
1.11% 28.78% 98.29% Dense 5.15ms 134.03ms
0.94% 0.94% 99.22% LogSoftMax 4.36ms 4.36ms
0.47% 0.47% 99.69% Conv1D 2.17ms 2.17ms
0.11% 17.43% 99.80% TransformerEncoderLayer 0.53ms 81.17ms
0.05% 3.65% 99.85% BestSampler 0.23ms 16.99ms
0.05% 0.36% 99.90% Embeddings 0.22ms 1.67ms
0.02% 0.02% 99.92% GELU 0.10ms 0.10ms
0.02% 0.02% 99.94% Tile 0.09ms 0.09ms
0.02% 17.96% 99.96% WhisperEncoder 0.08ms 83.62ms
Batch size: 4, Execution time: 0.896347900390625 seconds, Execution time per batch: 0.22408697509765624 seconds
19.76% 23.71% 19.76% Gemm 124.92ms 149.93ms
12.89% 12.89% 32.64% MatMul 81.49ms 81.49ms
11.11% 62.85% 43.75% MultiHeadAttention 70.25ms 397.43ms
10.07% 74.06% 53.82% beam_search 63.68ms 468.34ms
7.49% 25.44% 61.32% dot_product_attention 47.38ms 160.87ms
5.97% 5.97% 67.29% Concat 37.76ms 37.76ms
5.06% 5.06% 72.35% SoftMax 32.00ms 32.00ms
4.25% 4.25% 76.59% Gather 26.84ms 26.84ms
3.95% 3.95% 80.55% BiasAdd 25.01ms 25.01ms
3.77% 3.77% 84.32% TopK 23.84ms 23.84ms
3.21% 3.21% 87.53% LayerNorm 20.30ms 20.30ms
2.40% 2.40% 89.92% Transpose 15.16ms 15.16ms
2.09% 2.09% 92.01% Split 13.23ms 13.23ms
1.96% 1.96% 93.98% Add 12.40ms 12.40ms
1.32% 1.32% 95.29% LogSoftMax 8.33ms 8.33ms
1.13% 23.86% 96.42% TransformerEncoderLayer 7.14ms 150.90ms
1.07% 50.46% 97.50% TransformerDecoderLayer 6.79ms 319.10ms
0.98% 56.18% 98.47% TransformerDecoder 6.18ms 355.27ms
0.83% 24.54% 99.31% Dense 5.26ms 155.19ms
0.53% 0.53% 99.83% Conv1D 3.33ms 3.33ms
0.03% 3.80% 99.87% BestSampler 0.22ms 24.06ms
0.03% 0.25% 99.90% Embeddings 0.20ms 1.59ms
0.03% 0.03% 99.92% GELU 0.17ms 0.17ms
0.02% 0.02% 99.94% Tile 0.12ms 0.12ms
0.02% 24.46% 99.96% WhisperEncoder 0.12ms 154.68ms
Batch size: 8, Execution time: 1.4785286865234375 seconds, Execution time per batch: 0.1848160858154297 seconds
15.60% 68.98% 15.60% beam_search 157.47ms 696.20ms
13.92% 16.73% 29.52% Gemm 140.46ms 168.83ms
13.82% 13.82% 43.34% MatMul 139.50ms 139.50ms
12.16% 60.79% 55.51% MultiHeadAttention 122.76ms 613.52ms
8.89% 27.86% 64.40% dot_product_attention 89.77ms 281.16ms
6.55% 6.55% 70.95% Concat 66.08ms 66.08ms
5.25% 5.25% 76.20% Gather 53.04ms 53.04ms
5.14% 5.14% 81.34% SoftMax 51.88ms 51.88ms
3.89% 3.89% 85.24% TopK 39.29ms 39.29ms
2.81% 2.81% 88.05% BiasAdd 28.37ms 28.37ms
2.13% 2.13% 90.17% LayerNorm 21.48ms 21.48ms
1.64% 1.64% 91.81% Transpose 16.55ms 16.55ms
1.59% 1.59% 93.40% Split 16.06ms 16.06ms
1.34% 1.34% 94.74% LogSoftMax 13.49ms 13.49ms
1.33% 28.53% 96.07% TransformerEncoderLayer 13.41ms 287.96ms
1.25% 1.25% 97.32% Add 12.60ms 12.60ms
0.69% 44.64% 98.01% TransformerDecoder 6.96ms 450.53ms
0.68% 40.43% 98.69% TransformerDecoderLayer 6.89ms 408.08ms
0.51% 17.24% 99.20% Dense 5.13ms 173.96ms
0.48% 0.48% 99.67% Conv1D 4.80ms 4.80ms
0.23% 29.29% 99.90% WhisperEncoder 2.32ms 295.64ms
0.03% 0.03% 99.93% GELU 0.29ms 0.29ms
0.02% 3.91% 99.95% BestSampler 0.21ms 39.50ms
0.02% 0.16% 99.97% Embeddings 0.19ms 1.59ms
0.01% 0.01% 99.98% Tile 0.08ms 0.08ms
Batch size: 16, Execution time: 2.668415283203125 seconds, Execution time per batch: 0.1667759552001953 seconds
@funboarder13920 hi, are you using torch.profiler
or python cProfile
to obtain the analysis results mentioned above?
I am using the ct2 profiler
Hello!
Do you have any news on this? I'm building something with faster-whisper and I've just started to notice the proportional time increase to the number of elements in the batch
Hello @funboarder13920, @dyyzhmm, I'm curious to know how to use the ct2 profiling, as there isn't much of documentation wrt the same. I'd greatly appreciate your help or guidance in this regards. Can you pelase point me to the right method to invoke the profiling or a example which does so that'd be really helpful. Thanks in advance.