TensorRT-LLM
TensorRT-LLM copied to clipboard
BERT model can't be converted
System Info
cudn12.2
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
when i run bert exmple by following command: nohup python3 build.py --dtype=float16 --log_level=verbose > t2.log 2>&1 &
Expected behavior
convert to onnx model. now folder is empty
actual behavior
[01/26/2024-18:54:41] [TRT] [V] After Myelin optimization: 1 layers [01/26/2024-18:54:41] [TRT] [V] Applying ScaleNodes fusions. [01/26/2024-18:54:41] [TRT] [V] After scale fusion: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After vertical fusions: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [W] [RemoveDeadLayers] Input Tensor input_lengths is unused or used only at compile-time, but is not being removed. [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After slice removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After concat removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] Trying to split Reshape and strided tensor [01/26/2024-18:54:41] [TRT] [V] Graph optimization time: 0.0770116 seconds. [01/26/2024-18:54:41] [TRT] [V] Building graph using backend strategy 2 [01/26/2024-18:54:41] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [01/26/2024-18:54:41] [TRT] [V] Constructing optimization profile number 0 [1/1]. [01/26/2024-18:54:41] [TRT] [V] Applying generic optimizations to the graph for inference. [01/26/2024-18:54:42] [TRT] [V] Reserving memory for host IO tensors. Host: 0 bytes [01/26/2024-18:54:42] [TRT] [V] =============== Computing costs for {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} [01/26/2024-18:54:42] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Float((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:42] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)
additional notes
None
Could you share the full log? I don't see the error log here. If the program is crash directly, it might require more RAM to build the engine.
i catch the screenshot , gpu is full
there is no error, info file end with:
[01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)
then program exit
i think you are right. is there any way to reduce memory usage?
oh, the last is:
We don't have way to reduce the RAM usage now.
@whk6688 try reducing max_batch_size
in build.py
I set it to 128 and it fit to 4GB RAM