fastertransformer_backend Can't run multi-node GPTJ inference

I followed the tutorial provided here. I am able to run GPTJ-B on a single node. However, when I try the multi-node inference example with the following command on two nodes:

WORKSPACE="/workspace" # the dir you build the docker
CONTAINER_VERSION=22.07
IMAGE=bdhu/triton_with_ft:22.07
CMD="/opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"
srun -N 2 -n 2 --mpi=pmi2 -o inference_server.log \
               --container-mounts /home/edwardhu/my-script/triton:${WORKSPACE} \
               --container-name multi-node-ft-triton \
               --container-image ${IMAGE} \
               bash -c "$CMD"

It shows the following error in the log file:

...


E1011 02:30:34.325725 21361 libfastertransformer.cc:168] Invalid configuration argument 'is_half': stoi
E1011 02:30:34.325737 21361 libfastertransformer.cc:168] Invalid configuration argument 'max_seq_len': stoi
E1011 02:30:34.325742 21361 libfastertransformer.cc:168] Invalid configuration argument 'head_num': stoi
E1011 02:30:34.325747 21361 libfastertransformer.cc:168] Invalid configuration argument 'size_per_head': stoi
E1011 02:30:34.325752 21361 libfastertransformer.cc:168] Invalid configuration argument 'inter_size': stoi
E1011 02:30:34.325757 21361 libfastertransformer.cc:168] Invalid configuration argument 'decoder_layers': stoi
E1011 02:30:34.325761 21361 libfastertransformer.cc:168] Invalid configuration argument 'vocab_size': stoi
E1011 02:30:34.325766 21361 libfastertransformer.cc:168] Invalid configuration argument 'rotary_embedding': stoi
E1011 02:30:34.325770 21361 libfastertransformer.cc:168] Invalid configuration argument 'start_id': stoi
E1011 02:30:34.325775 21361 libfastertransformer.cc:168] Invalid configuration argument 'end_id': stoi
W1011 02:30:34.325785 21361 libfastertransformer.cc:334] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
I1011 02:30:34.327796 24630 libfastertransformer.cc:420] Before Loading Weights:
I1011 02:30:34.328497 21361 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W1011 02:30:34.328515 21361 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W1011 02:30:34.328518 21361 libfastertransformer.cc:459] Model name gpt-j-6b
W1011 02:30:34.328540 21361 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328543 21361 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328546 21361 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328548 21361 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328551 21361 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328554 21361 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328556 21361 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328559 21361 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328561 21361 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328572 21361 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328574 21361 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328577 21361 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_UINT64, shape: [1]
W1011 02:30:34.328579 21361 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W1011 02:30:34.328581 21361 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328584 21361 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328587 21361 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328590 21361 libfastertransformer.cc:578] Get input name: prompt_learning_task_name_ids, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328594 21361 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W1011 02:30:34.328597 21361 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328599 21361 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W1011 02:30:34.328602 21361 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
after allocation    : free: 15.36 GB, total: 15.78 GB, used:  0.42 GB
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.weight.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.query_key_value.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.dense.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.bias.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.weight.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.query_key_value.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.dense.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.mlp.dense_h_to_4h.weight.0.bin 


.... many similar warnings



after allocation    : free: 13.27 GB, total: 15.78 GB, used:  2.51 GB
W1011 02:30:46.476398 24630 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
[node5:21361] *** An error occurred in MPI_Bcast
[node5:21361] *** reported by process [4915207,1]
[node5:21361] *** on communicator MPI_COMM_WORLD
[node5:21361] *** MPI_ERR_TRUNCATE: message truncated
[node5:21361] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:21361] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 75.7 ON node4 CANCELLED AT 2022-10-10T21:30:46 ***

Is there any hint over how to resolve this issue? Thanks!

Oct 11 '22 02:10 BDHU

@BDHU can you try to run a simple mpi example before starting the triton server in order to make sure MPI works as expected ? It could be the case that pmi2 doesn't work with your slurm system, or you could try pmix. You can check this by running: srun --mpi=list

Oct 11 '22 03:10 PerkzZheng

Moreover, your triton server seems to be based on the older version of ft_triton_backend (Invalid configuration argument 'is_half': stoi). You may need to update the triton_backend.so by setting CMD = "cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/fastertransformer; /opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"

Oct 11 '22 03:10 PerkzZheng

@PerkzZheng Thanks for the reply. I did try using pmix and I used the ring_c.c example in the ompi repo. I am able to run that program successfully on two nodes with: srun --mpi=pmix -N 2 ./a.out.

Running srun --mpi=list shows:

MPI plugin types are...
	pmix
	cray_shasta
	pmi2
	none
specific pmix plugin versions available: pmix_v4

However, even with pmix the following error persists:

[node5:45239] *** An error occurred in MPI_Bcast
[node5:45239] *** reported by process [570632255,1]
[node5:45239] *** on communicator MPI_COMM_WORLD
[node5:45239] *** MPI_ERR_TRUNCATE: message truncated
[node5:45239] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:45239] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 173.0 ON node4 CANCELLED AT 2022-10-13T23:00:52 ***

Oct 14 '22 04:10 BDHU

@BDHU can you share the config.pbtx, and config.ini (generated when converting the checkpoint) ?

Oct 14 '22 04:10 PerkzZheng

Both files are in the /${workspace}/all_models/gptj/fastertransformer directory.

Here is the config.pbtx:

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "4"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "2"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT-J"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    #string_value: "/data/models/GPT-J/EleutherAI/gptj-model/c-model/4-gpu/"
    string_value: "/workspace/all_models/gptj/fastertransformer/1/4-gpu/"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

And here is the config.ini:

[gptj]
model_name = gptj-6B
head_num = 16
size_per_head = 256
inter_size = 16384
num_layer = 28
rotary_embedding = 64
vocab_size = 50400
start_id = 50256
end_id = 50256
weight_data_type = fp32

Oct 14 '22 04:10 BDHU

so you have 4 GPUs for each node?

Oct 14 '22 04:10 PerkzZheng

so you have 4 GPUs for each node?

That's correct, 4 V100 on each node.

Oct 14 '22 04:10 BDHU

can you share the full logs (attached it)? I don't see any noticeable clues in the above log.

Oct 14 '22 04:10 PerkzZheng

Here's the log from running srun: inference_server.log

I've also attached the slurmd.log file from both nodes just in case:

slurmd.node4.log

slurmd.node5.log

Oct 14 '22 04:10 BDHU

Per your instruction on rebuilding fastertransformer_backend, it seems like after rebuilding the error was now related to NCCL:

I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?

inference_server.log

I also tried to change the network interface for NCCL using export NCCL_SOCKET_IFNAME=eno4 (the TCP connection), which creates new error:

inference_server.log

I guess the problem has something to do with the cross-node communication? Perhaps there is a way to specify that in config.pbtx?

Oct 14 '22 06:10 BDHU

Per your instruction on rebuilding fastertransformer_backend, it seems like after rebuilding the error was now related to NCCL: I suspect it has something to do with the way two nodes are connected? Since I only use TCP between these two nodes, maybe NCCL is not compatible with tcp?

you can try to add NCCL_DEBUG=INFO, which will give further information. And run nccl-tests to make sure NCCL works as expected. It could be a problem when NCCL tries to create the communicator among nodes.

Oct 14 '22 06:10 PerkzZheng

fastertransformer_backend fastertransformer_backend copied to clipboard

Can't run multi-node GPTJ inference

fastertransformer_backend
fastertransformer_backend copied to clipboard