fastertransformer_backend
fastertransformer_backend copied to clipboard
Can't run multi-node GPTJ inference
I followed the tutorial provided here. I am able to run GPTJ-B on a single node. However, when I try the multi-node inference example with the following command on two nodes:
WORKSPACE="/workspace" # the dir you build the docker
CONTAINER_VERSION=22.07
IMAGE=bdhu/triton_with_ft:22.07
CMD="/opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"
srun -N 2 -n 2 --mpi=pmi2 -o inference_server.log \
--container-mounts /home/edwardhu/my-script/triton:${WORKSPACE} \
--container-name multi-node-ft-triton \
--container-image ${IMAGE} \
bash -c "$CMD"
It shows the following error in the log file:
...
E1011 02:30:34.325725 21361 libfastertransformer.cc:168] Invalid configuration argument 'is_half': stoi
E1011 02:30:34.325737 21361 libfastertransformer.cc:168] Invalid configuration argument 'max_seq_len': stoi
E1011 02:30:34.325742 21361 libfastertransformer.cc:168] Invalid configuration argument 'head_num': stoi
E1011 02:30:34.325747 21361 libfastertransformer.cc:168] Invalid configuration argument 'size_per_head': stoi
E1011 02:30:34.325752 21361 libfastertransformer.cc:168] Invalid configuration argument 'inter_size': stoi
E1011 02:30:34.325757 21361 libfastertransformer.cc:168] Invalid configuration argument 'decoder_layers': stoi
E1011 02:30:34.325761 21361 libfastertransformer.cc:168] Invalid configuration argument 'vocab_size': stoi
E1011 02:30:34.325766 21361 libfastertransformer.cc:168] Invalid configuration argument 'rotary_embedding': stoi
E1011 02:30:34.325770 21361 libfastertransformer.cc:168] Invalid configuration argument 'start_id': stoi
E1011 02:30:34.325775 21361 libfastertransformer.cc:168] Invalid configuration argument 'end_id': stoi
W1011 02:30:34.325785 21361 libfastertransformer.cc:334] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
I1011 02:30:34.327796 24630 libfastertransformer.cc:420] Before Loading Weights:
I1011 02:30:34.328497 21361 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W1011 02:30:34.328515 21361 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W1011 02:30:34.328518 21361 libfastertransformer.cc:459] Model name gpt-j-6b
W1011 02:30:34.328540 21361 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328543 21361 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328546 21361 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328548 21361 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328551 21361 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328554 21361 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328556 21361 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328559 21361 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328561 21361 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328572 21361 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328574 21361 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328577 21361 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_UINT64, shape: [1]
W1011 02:30:34.328579 21361 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W1011 02:30:34.328581 21361 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328584 21361 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328587 21361 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328590 21361 libfastertransformer.cc:578] Get input name: prompt_learning_task_name_ids, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328594 21361 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W1011 02:30:34.328597 21361 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328599 21361 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W1011 02:30:34.328602 21361 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
after allocation : free: 15.36 GB, total: 15.78 GB, used: 0.42 GB
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.bias.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.weight.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.query_key_value.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.dense.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.bias.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.bias.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.bias.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.weight.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.query_key_value.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.dense.weight.0.bin
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.mlp.dense_h_to_4h.weight.0.bin
.... many similar warnings
after allocation : free: 13.27 GB, total: 15.78 GB, used: 2.51 GB
W1011 02:30:46.476398 24630 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
[node5:21361] *** An error occurred in MPI_Bcast
[node5:21361] *** reported by process [4915207,1]
[node5:21361] *** on communicator MPI_COMM_WORLD
[node5:21361] *** MPI_ERR_TRUNCATE: message truncated
[node5:21361] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:21361] *** and potentially your MPI job)
slurmstepd: error: *** STEP 75.7 ON node4 CANCELLED AT 2022-10-10T21:30:46 ***
Is there any hint over how to resolve this issue? Thanks!
@BDHU can you try to run a simple mpi example before starting the triton server in order to make sure MPI works as expected ?
It could be the case that pmi2
doesn't work with your slurm system, or you could try pmix
.
You can check this by running:
srun --mpi=list
Moreover, your triton server seems to be based on the older version of ft_triton_backend (Invalid configuration argument 'is_half': stoi
). You may need to update the triton_backend.so by setting CMD = "cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/fastertransformer; /opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"
@PerkzZheng Thanks for the reply. I did try using pmix
and I used the ring_c.c example in the ompi repo. I am able to run that program successfully on two nodes with: srun --mpi=pmix -N 2 ./a.out
.
Running srun --mpi=list
shows:
MPI plugin types are...
pmix
cray_shasta
pmi2
none
specific pmix plugin versions available: pmix_v4
However, even with pmix
the following error persists:
[node5:45239] *** An error occurred in MPI_Bcast
[node5:45239] *** reported by process [570632255,1]
[node5:45239] *** on communicator MPI_COMM_WORLD
[node5:45239] *** MPI_ERR_TRUNCATE: message truncated
[node5:45239] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:45239] *** and potentially your MPI job)
slurmstepd: error: *** STEP 173.0 ON node4 CANCELLED AT 2022-10-13T23:00:52 ***
@BDHU can you share the config.pbtx
, and config.ini
(generated when converting the checkpoint) ?
Both files are in the /${workspace}/all_models/gptj/fastertransformer
directory.
Here is the config.pbtx
:
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "start_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "prompt_learning_task_name_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_UINT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "4"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "2"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "model_type"
value: {
string_value: "GPT-J"
}
}
parameters {
key: "model_checkpoint_path"
value: {
#string_value: "/data/models/GPT-J/EleutherAI/gptj-model/c-model/4-gpu/"
string_value: "/workspace/all_models/gptj/fastertransformer/1/4-gpu/"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
And here is the config.ini
:
[gptj]
model_name = gptj-6B
head_num = 16
size_per_head = 256
inter_size = 16384
num_layer = 28
rotary_embedding = 64
vocab_size = 50400
start_id = 50256
end_id = 50256
weight_data_type = fp32
so you have 4 GPUs for each node?
so you have 4 GPUs for each node?
That's correct, 4 V100 on each node.
can you share the full logs (attached it)? I don't see any noticeable clues in the above log.
Here's the log from running srun
:
inference_server.log
I've also attached the slurmd.log
file from both nodes just in case:
Per your instruction on rebuilding fastertransformer_backend
, it seems like after rebuilding the error was now related to NCCL:
I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?
I also tried to change the network interface for NCCL using export NCCL_SOCKET_IFNAME=eno4
(the TCP connection), which creates new error:
I guess the problem has something to do with the cross-node communication? Perhaps there is a way to specify that in config.pbtx
?
Per your instruction on rebuilding fastertransformer_backend, it seems like after rebuilding the error was now related to NCCL: I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?
you can try to add NCCL_DEBUG=INFO
, which will give further information. And run nccl-tests to make sure NCCL works as expected. It could be a problem when NCCL tries to create the communicator among nodes.