TensorRT-LLM BERT model can't be converted

System Info

cudn12.2

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

when i run bert exmple by following command: nohup python3 build.py --dtype=float16 --log_level=verbose > t2.log 2>&1 &

Expected behavior

convert to onnx model. now folder is empty

actual behavior

[01/26/2024-18:54:41] [TRT] [V] After Myelin optimization: 1 layers [01/26/2024-18:54:41] [TRT] [V] Applying ScaleNodes fusions. [01/26/2024-18:54:41] [TRT] [V] After scale fusion: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After vertical fusions: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [W] [RemoveDeadLayers] Input Tensor input_lengths is unused or used only at compile-time, but is not being removed. [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After slice removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After concat removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] Trying to split Reshape and strided tensor [01/26/2024-18:54:41] [TRT] [V] Graph optimization time: 0.0770116 seconds. [01/26/2024-18:54:41] [TRT] [V] Building graph using backend strategy 2 [01/26/2024-18:54:41] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [01/26/2024-18:54:41] [TRT] [V] Constructing optimization profile number 0 [1/1]. [01/26/2024-18:54:41] [TRT] [V] Applying generic optimizations to the graph for inference. [01/26/2024-18:54:42] [TRT] [V] Reserving memory for host IO tensors. Host: 0 bytes [01/26/2024-18:54:42] [TRT] [V] =============== Computing costs for {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} [01/26/2024-18:54:42] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Float((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:42] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)

additional notes

None

Jan 29 '24 05:01 whk6688

Could you share the full log? I don't see the error log here. If the program is crash directly, it might require more RAM to build the engine.

Jan 30 '24 07:01 byshiue

i catch the screenshot , gpu is full

F9qxEcsk91

Jan 31 '24 03:01 whk6688

there is no error, info file end with:

[01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] *************** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *************** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)

then program exit

Jan 31 '24 03:01 whk6688

i think you are right. is there any way to reduce memory usage?

Jan 31 '24 03:01 whk6688

oh, the last is:

Jan 31 '24 05:01 whk6688

We don't have way to reduce the RAM usage now.

Feb 02 '24 09:02 byshiue

@whk6688 try reducing max_batch_size in build.py I set it to 128 and it fit to 4GB RAM

Feb 02 '24 16:02 Muhtasham

TensorRT-LLM TensorRT-LLM copied to clipboard

BERT model can't be converted

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard