fastertransformer_backend
fastertransformer_backend copied to clipboard
Memory usage not going up with model instances
Hi,
I am using this backend for inference with GPT-J model (Codegen converted to GPT-J checkpoint to be precise). And I'm trying to load more than one model instances to process concurrent requests. However, with increasing no. of models, the GPU memory usage doesn't go up. The first model takes about 6GB
of memory but all subsequent models only result in tiny fraction of that memory. Was wondering if this is a bug?
Here're the relevant details of the confix.pbtxt
file:
instance_group [
{
count: 3
kind : KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
Any help would be appreciated!
All instances will share same model weights. So, they only additional workspace for computing.