CTranslate2 faster_whisper batch operation time-consuming

Is this related to CTranslate?

The following is copied from this . I have made a test, for batching in faster-whisper.

But faster_whisper batch encode consume multiple time as sample's amount, it seems encode in batch not work as expected by CTranslate?

By the way, the decode/generate have the same issue when in-Batch ops. 290427349-927b5fdc-d58b-45a2-87f3-0da80d3a5ac3

290427947-e0728401-4d04-4373-85c0-23bb59c1d7af

Dec 14 '23 06:12 dyyzhmm

Can you move the input features to the model device before measuring the execution time ?

Dec 14 '23 14:12 funboarder13920

I ran some test, the inference speed results are suspicious for encode but not for generate. One is using post method : https://github.com/OpenNMT/CTranslate2/blob/4f8a4f334c59588223b6f1f24b707d7e8d5fe08c/src/models/whisper.cc#L656

The other is using post_batch: https://github.com/OpenNMT/CTranslate2/blob/4f8a4f334c59588223b6f1f24b707d7e8d5fe08c/src/models/whisper.cc#L667

Here is Guillaume's PR where the encode method was implemented. I don't know why Guillaume did not use post_batch though. Maybe the encode method only relies on cuda code that should be efficient on a tensor with batches compared to generate that is running high level logic ? https://github.com/OpenNMT/CTranslate2/pull/1144

I can blindly try to use post_batch.

@vince62s any thoughts on that ? Or can you tag someone who could help on ct2?

Dec 14 '23 15:12 funboarder13920

Can you move the input features to the model device before measuring the execution time ?

10 samples.

Before changes

def get_ctranslate2_storage(segment: np.ndarray) -> ctranslate2.StorageView:
    segment = np.ascontiguousarray(segment)
    segment = ctranslate2.StorageView.from_array(segment)
    return segment

prepare features cost: 0.527921199798584
get_ctranslate2_storage cost 0.013983964920043945 seconds 
self.model.encode [10, 128, 3000]  cost 5.466844797134399 seconds 
encoder_output shape:[10, 1500, 1280]
encode cost: 5.481068134307861

changes

def get_ctranslate2_storage(segment: np.ndarray) -> ctranslate2.StorageView:
    segment = np.ascontiguousarray(segment)
    segment = torch.from_numpy(segment).float().to("cuda:0")  #added line.
    segment = ctranslate2.StorageView.from_array(segment)
    return segment

prepare features cost: 0.47109270095825195
get_ctranslate2_storage cost 0.6899769306182861 seconds 
self.model.encode [10, 128, 3000]  cost 5.4121253490448 seconds 
encoder_output shape:[10, 1500, 1280]
encode cost: 6.102340459823608

Conclusion:

move features from cpu to cuda:0 (where the model is) in get_ctranslate2_storage , leads to 0.0139 sec -> 0.6899 sec, and encode 5.4668 sec -> 5.4121 sec.
So, the problem is still from CT2 encode , not copy features between CPU and GPU.

Dec 15 '23 03:12 dyyzhmm

Here is a more detailed profiling with whisper tiny :

45.09%  77.56%  45.09% MultiHeadAttention      13.81ms
14.73%  14.73%  59.82% BiasAdd                 4.51ms
12.79%  27.28%  72.61% dot_product_attention   3.92ms
 7.45%   7.45%  80.06% SoftMax                 2.28ms
 7.04%   7.04%  87.10% MatMul                  2.16ms
 6.88%  21.61%  93.98% Gemm                    2.11ms
 2.84%   2.84%  96.83% Conv1D                  0.87ms
 0.79%   0.79%  97.62% Transpose               0.24ms
 0.64%   0.64%  98.26% LayerNorm               0.20ms
 0.50%   0.50%  98.76% Split                   0.15ms
 0.37%   0.37%  99.13% Add                     0.11ms
 0.27%  96.59%  99.40% TransformerEncoderLayer 0.08ms
 0.20%  99.90%  99.60% WhisperEncoder          0.06ms
 0.17%  21.78%  99.76% Dense                   0.05ms
 0.13%   0.13%  99.90% GELU                    0.04ms
0.164976 0.107653 0.0736297 ... -0.023933 -0.588876 -0.235716
Batch size: 1, Execution time: 0.035004863739013674 seconds, Execution time per batch: 0.035004863739013674 seconds


35.34%  50.61%  35.34% dot_product_attention   17.50ms
27.67%  91.61%  63.01% MultiHeadAttention      13.70ms
14.65%  15.88%  77.66% Gemm                    7.25ms
 7.84%   7.84%  85.51% SoftMax                 3.88ms
 7.42%   7.42%  92.93% MatMul                  3.67ms
 2.03%   2.03%  94.95% Conv1D                  1.00ms
 1.25%  97.56%  96.21% TransformerEncoderLayer 0.62ms
 1.22%   1.22%  97.43% BiasAdd                 0.61ms
 0.75%   0.75%  98.18% Transpose               0.37ms
 0.59%   0.59%  98.77% LayerNorm               0.29ms
 0.44%   0.44%  99.21% Split                   0.22ms
 0.38%   0.38%  99.59% Add                     0.19ms
 0.14%  99.95%  99.73% WhisperEncoder          0.07ms
 0.12%  15.99%  99.84% Dense                   0.06ms
 0.11%   0.11%  99.95% GELU                    0.06ms
0.164976 0.107653 0.0736297 ... -0.0239328 -0.588876 -0.235715
Batch size: 2, Execution time: 0.058016990661621096 seconds, Execution time per batch: 0.029008495330810548 seconds


43.45%  92.19%  43.45% MultiHeadAttention      40.96ms
18.09%  32.20%  61.55% dot_product_attention   17.05ms
 9.94%   9.94%  71.49% Transpose               9.37ms
 8.76%   9.83%  80.25% Gemm                    8.26ms
 7.78%   7.78%  88.03% SoftMax                 7.34ms
 6.33%   6.33%  94.36% MatMul                  5.96ms
 2.64%   2.64%  97.00% Conv1D                  2.49ms
 1.07%   1.07%  98.07% BiasAdd                 1.01ms
 0.56%  97.08%  98.63% TransformerEncoderLayer 0.53ms
 0.44%   0.44%  99.07% LayerNorm               0.41ms
 0.35%   0.35%  99.42% Add                     0.33ms
 0.35%   0.35%  99.77% Split                   0.33ms
 0.10%   0.10%  99.86% GELU                    0.09ms
 0.08%  99.99%  99.94% WhisperEncoder          0.07ms
 0.05%   9.88%  99.99% Dense                   0.04ms
0.164977 0.107653 0.07363 ... -0.0239328 -0.588876 -0.235715
Batch size: 4, Execution time: 0.09641846466064453 seconds, Execution time per batch: 0.024104616165161134 seconds



31.35%  86.33%  31.35% MultiHeadAttention      47.57ms
26.44%  44.62%  57.79% dot_product_attention   40.13ms
10.99%  10.99%  68.78% SoftMax                 16.68ms
 8.05%   9.28%  76.83% Gemm                    12.22ms
 7.19%   7.19%  84.02% MatMul                  10.91ms
 4.52%  97.61%  88.54% TransformerEncoderLayer 6.86ms
 2.79%   2.79%  91.33% Transpose               4.24ms
 2.39%   2.39%  93.72% Split                   3.63ms
 2.35%   2.35%  96.08% LayerNorm               3.57ms
 2.08%   2.08%  98.16% Conv1D                  3.16ms
 1.23%   1.23%  99.39% BiasAdd                 1.87ms
 0.38%   0.38%  99.77% Add                     0.57ms
 0.11%   0.11%  99.88% GELU                    0.17ms
 0.07%  99.98%  99.96% WhisperEncoder          0.11ms
 0.03%   9.31%  99.98% Dense                   0.04ms
0.164976 0.107653 0.0736298 ... -0.0239328 -0.588876 -0.235715
Batch size: 8, Execution time: 0.1561301727294922 seconds, Execution time per batch: 0.019516271591186525 seconds


32.22%  88.50%  32.22% MultiHeadAttention      104.86ms
27.55%  48.49%  59.77% dot_product_attention   89.65ms
11.62%  11.62%  71.39% MatMul                  37.82ms
 9.33%   9.33%  80.72% SoftMax                 30.36ms
 6.91%   8.00%  87.62% Gemm                    22.49ms
 4.16%  97.41%  91.79% TransformerEncoderLayer 13.55ms
 1.69%   1.69%  93.48% Transpose               5.51ms
 1.69%   1.69%  95.17% Conv1D                  5.50ms
 1.35%   1.35%  96.52% Split                   4.38ms
 1.23%   1.23%  97.75% LayerNorm               4.02ms
 1.09%   1.09%  98.84% BiasAdd                 3.56ms
 0.72%  99.99%  99.56% WhisperEncoder          2.33ms
 0.32%   0.32%  99.89% Add                     1.05ms
 0.09%   0.09%  99.98% GELU                    0.30ms
 0.01%   8.02%  99.99% Dense                   0.04ms
0.164977 0.107653 0.07363 ... -0.023933 -0.588876 -0.235715
Batch size: 16, Execution time: 0.3303467102050781 seconds, Execution time per batch: 0.020646669387817382 seconds

Dec 15 '23 17:12 funboarder13920

Here is a profiling of the CT2 generate function that gathers encoding and decoding. It seems that the decoding in beam_search is quite efficient with batching and represents the main time consuming task. On the other hand, the encoding in WhisperEncoder does not really benefit from the encoding. It is possible that there is nothing wrong inside CT2 and that there is a bottleneck in the architecture / computations that does not make batching batch_size times faster. There is still a small improvement in encoder batching, just not proportional to the batch_size.

Also there is some optimization done within the replica pool that is proceeding with a silent batching if I understand correctly, hence this silent batching could explain why ct2 whisper is fast and why manual batching does not provide that much improvement.

I will stop there as I could not find anything obviously wrong, feel free to continue investigating.

 28.70%  34.79%  28.70% Gemm                    104.68ms 126.92ms
 11.91%  11.91%  40.61% MatMul                  43.45ms 43.45ms
  6.38%  57.72%  46.98% MultiHeadAttention      23.26ms 210.55ms
  6.10%   6.10%  53.08% BiasAdd                 22.24ms 22.24ms
  5.96%  91.42%  59.04% beam_search             21.75ms 333.51ms
  5.93%   5.93%  64.97% Concat                  21.62ms 21.62ms
  4.57%   4.57%  69.54% LayerNorm               16.67ms 16.67ms
  4.49%   4.49%  74.03% TopK                    16.39ms 16.39ms
  4.37%   4.37%  78.40% Gather                  15.93ms 15.93ms
  3.69%   3.69%  82.09% SoftMax                 13.45ms 13.45ms
  3.06%   3.06%  85.15% Add                     11.18ms 11.18ms
  3.04%  18.64%  88.19% dot_product_attention   11.11ms 68.01ms
  2.72%   2.72%  90.91% Transpose               9.92ms 9.92ms
  2.45%   2.45%  93.37% Split                   8.94ms 8.94ms
  1.85%  67.98%  95.22% TransformerDecoderLayer 6.76ms 248.01ms
  1.63%  76.54%  96.85% TransformerDecoder      5.94ms 279.23ms
  1.46%  36.25%  98.31% Dense                   5.33ms 132.25ms
  1.17%   1.17%  99.48% LogSoftMax              4.28ms 4.28ms
  0.23%   0.23%  99.71% Conv1D                  0.83ms 0.83ms
  0.07%   4.56%  99.78% BestSampler             0.24ms 16.63ms
  0.06%   0.48%  99.84% Embeddings              0.23ms 1.76ms
  0.05%   7.40%  99.89% TransformerEncoderLayer 0.19ms 27.00ms
  0.03%   0.03%  99.92% Tile                    0.12ms 0.12ms
  0.02%   7.67%  99.94% WhisperEncoder          0.06ms 27.97ms
  0.01%   0.01%  99.95% GELU                    0.04ms 0.04ms
Batch size: 1, Execution time: 0.4774452209472656 seconds, Execution time per batch: 0.4774452209472656 seconds


 25.34%  31.33%  25.34% Gemm                    94.46ms 116.80ms
 12.93%  12.93%  38.27% MatMul                  48.19ms 48.19ms
  8.95%  61.25%  47.22% MultiHeadAttention      33.36ms 228.32ms
  6.17%  88.31%  53.39% beam_search             23.02ms 329.21ms
  5.99%   5.99%  59.38% BiasAdd                 22.34ms 22.34ms
  4.57%   4.57%  63.96% Concat                  17.05ms 17.05ms
  4.51%   4.51%  68.47% TopK                    16.81ms 16.81ms
  4.47%   4.47%  72.94% LayerNorm               16.68ms 16.68ms
  4.06%   4.06%  77.00% SoftMax                 15.14ms 15.14ms
  4.00%   4.00%  81.01% Gather                  14.92ms 14.92ms
  3.92%  20.91%  84.93% dot_product_attention   14.63ms 77.97ms
  3.11%   3.11%  88.04% Add                     11.59ms 11.59ms
  2.74%   2.74%  90.77% Transpose               10.20ms 10.20ms
  2.42%   2.42%  93.19% Split                   9.02ms 9.02ms
  1.83%  65.48%  95.02% TransformerDecoderLayer 6.82ms 244.09ms
  1.59%  73.83%  96.61% TransformerDecoder      5.91ms 275.21ms
  1.44%  32.78%  98.05% Dense                   5.38ms 122.18ms
  1.15%   1.15%  99.21% LogSoftMax              4.30ms 4.30ms
  0.37%   0.37%  99.58% Conv1D                  1.40ms 1.40ms
  0.16%  10.06%  99.74% TransformerEncoderLayer 0.60ms 37.49ms
  0.07%   4.58%  99.81% BestSampler             0.24ms 17.06ms
  0.06%   0.46%  99.87% Embeddings              0.23ms 1.72ms
  0.03%   0.03%  99.90% Tile                    0.11ms 0.11ms
  0.02%  10.49%  99.92% WhisperEncoder          0.08ms 39.09ms
  0.02%   0.02%  99.94% GELU                    0.07ms 0.07ms
Batch size: 2, Execution time: 0.5886483154296875 seconds, Execution time per batch: 0.29432415771484377 seconds


 22.61%  27.68%  22.61% Gemm                    105.30ms 128.88ms
 12.63%  12.63%  35.25% MatMul                  58.83ms 58.83ms
 11.77%  64.48%  47.02% MultiHeadAttention      54.81ms 300.27ms
  7.39%  81.07%  54.41% beam_search             34.43ms 377.52ms
  5.68%   5.68%  60.09% Concat                  26.44ms 26.44ms
  5.63%  22.68%  65.72% dot_product_attention   26.20ms 105.63ms
  5.06%   5.06%  70.78% BiasAdd                 23.59ms 23.59ms
  5.00%   5.00%  75.78% Gather                  23.27ms 23.27ms
  4.42%   4.42%  80.20% SoftMax                 20.59ms 20.59ms
  3.61%   3.61%  83.81% LayerNorm               16.79ms 16.79ms
  3.60%   3.60%  87.41% TopK                    16.77ms 16.77ms
  2.54%   2.54%  89.94% Transpose               11.82ms 11.82ms
  2.52%   2.52%  92.46% Add                     11.71ms 11.71ms
  1.95%   1.95%  94.41% Split                   9.06ms 9.06ms
  1.45%  58.49%  95.86% TransformerDecoderLayer 6.76ms 272.36ms
  1.32%  65.23%  97.18% TransformerDecoder      6.16ms 303.75ms
  1.11%  28.78%  98.29% Dense                   5.15ms 134.03ms
  0.94%   0.94%  99.22% LogSoftMax              4.36ms 4.36ms
  0.47%   0.47%  99.69% Conv1D                  2.17ms 2.17ms
  0.11%  17.43%  99.80% TransformerEncoderLayer 0.53ms 81.17ms
  0.05%   3.65%  99.85% BestSampler             0.23ms 16.99ms
  0.05%   0.36%  99.90% Embeddings              0.22ms 1.67ms
  0.02%   0.02%  99.92% GELU                    0.10ms 0.10ms
  0.02%   0.02%  99.94% Tile                    0.09ms 0.09ms
  0.02%  17.96%  99.96% WhisperEncoder          0.08ms 83.62ms
Batch size: 4, Execution time: 0.896347900390625 seconds, Execution time per batch: 0.22408697509765624 seconds


 19.76%  23.71%  19.76% Gemm                    124.92ms 149.93ms
 12.89%  12.89%  32.64% MatMul                  81.49ms 81.49ms
 11.11%  62.85%  43.75% MultiHeadAttention      70.25ms 397.43ms
 10.07%  74.06%  53.82% beam_search             63.68ms 468.34ms
  7.49%  25.44%  61.32% dot_product_attention   47.38ms 160.87ms
  5.97%   5.97%  67.29% Concat                  37.76ms 37.76ms
  5.06%   5.06%  72.35% SoftMax                 32.00ms 32.00ms
  4.25%   4.25%  76.59% Gather                  26.84ms 26.84ms
  3.95%   3.95%  80.55% BiasAdd                 25.01ms 25.01ms
  3.77%   3.77%  84.32% TopK                    23.84ms 23.84ms
  3.21%   3.21%  87.53% LayerNorm               20.30ms 20.30ms
  2.40%   2.40%  89.92% Transpose               15.16ms 15.16ms
  2.09%   2.09%  92.01% Split                   13.23ms 13.23ms
  1.96%   1.96%  93.98% Add                     12.40ms 12.40ms
  1.32%   1.32%  95.29% LogSoftMax              8.33ms 8.33ms
  1.13%  23.86%  96.42% TransformerEncoderLayer 7.14ms 150.90ms
  1.07%  50.46%  97.50% TransformerDecoderLayer 6.79ms 319.10ms
  0.98%  56.18%  98.47% TransformerDecoder      6.18ms 355.27ms
  0.83%  24.54%  99.31% Dense                   5.26ms 155.19ms
  0.53%   0.53%  99.83% Conv1D                  3.33ms 3.33ms
  0.03%   3.80%  99.87% BestSampler             0.22ms 24.06ms
  0.03%   0.25%  99.90% Embeddings              0.20ms 1.59ms
  0.03%   0.03%  99.92% GELU                    0.17ms 0.17ms
  0.02%   0.02%  99.94% Tile                    0.12ms 0.12ms
  0.02%  24.46%  99.96% WhisperEncoder          0.12ms 154.68ms
Batch size: 8, Execution time: 1.4785286865234375 seconds, Execution time per batch: 0.1848160858154297 seconds


 15.60%  68.98%  15.60% beam_search             157.47ms 696.20ms
 13.92%  16.73%  29.52% Gemm                    140.46ms 168.83ms
 13.82%  13.82%  43.34% MatMul                  139.50ms 139.50ms
 12.16%  60.79%  55.51% MultiHeadAttention      122.76ms 613.52ms
  8.89%  27.86%  64.40% dot_product_attention   89.77ms 281.16ms
  6.55%   6.55%  70.95% Concat                  66.08ms 66.08ms
  5.25%   5.25%  76.20% Gather                  53.04ms 53.04ms
  5.14%   5.14%  81.34% SoftMax                 51.88ms 51.88ms
  3.89%   3.89%  85.24% TopK                    39.29ms 39.29ms
  2.81%   2.81%  88.05% BiasAdd                 28.37ms 28.37ms
  2.13%   2.13%  90.17% LayerNorm               21.48ms 21.48ms
  1.64%   1.64%  91.81% Transpose               16.55ms 16.55ms
  1.59%   1.59%  93.40% Split                   16.06ms 16.06ms
  1.34%   1.34%  94.74% LogSoftMax              13.49ms 13.49ms
  1.33%  28.53%  96.07% TransformerEncoderLayer 13.41ms 287.96ms
  1.25%   1.25%  97.32% Add                     12.60ms 12.60ms
  0.69%  44.64%  98.01% TransformerDecoder      6.96ms 450.53ms
  0.68%  40.43%  98.69% TransformerDecoderLayer 6.89ms 408.08ms
  0.51%  17.24%  99.20% Dense                   5.13ms 173.96ms
  0.48%   0.48%  99.67% Conv1D                  4.80ms 4.80ms
  0.23%  29.29%  99.90% WhisperEncoder          2.32ms 295.64ms
  0.03%   0.03%  99.93% GELU                    0.29ms 0.29ms
  0.02%   3.91%  99.95% BestSampler             0.21ms 39.50ms
  0.02%   0.16%  99.97% Embeddings              0.19ms 1.59ms
  0.01%   0.01%  99.98% Tile                    0.08ms 0.08ms
Batch size: 16, Execution time: 2.668415283203125 seconds, Execution time per batch: 0.1667759552001953 seconds

Dec 19 '23 09:12 funboarder13920

@funboarder13920 hi, are you using torch.profiler or python cProfile to obtain the analysis results mentioned above?

Dec 20 '23 02:12 dyyzhmm

I am using the ct2 profiler

Dec 20 '23 08:12 funboarder13920

Hello!

Do you have any news on this? I'm building something with faster-whisper and I've just started to notice the proportional time increase to the number of elements in the batch

Mar 11 '24 08:03 DiegoSanC

Hello @funboarder13920, @dyyzhmm, I'm curious to know how to use the ct2 profiling, as there isn't much of documentation wrt the same. I'd greatly appreciate your help or guidance in this regards. Can you pelase point me to the right method to invoke the profiling or a example which does so that'd be really helpful. Thanks in advance.

Jun 28 '24 06:06 prasad-pr-20

CTranslate2 CTranslate2 copied to clipboard

faster_whisper batch operation time-consuming

CTranslate2
CTranslate2 copied to clipboard