parallelformers icon indicating copy to clipboard operation
parallelformers copied to clipboard

Support for XGLM: How to achieve faster inference speed?

Open un-certainty opened this issue 2 years ago • 5 comments

Describe a requested feature

Thanks for releasing this great library! I am currently working on deploying facebook/xglm-7.5B, which is currently not supported by parallelformers.

POLICY.md provides a comprehensive guide for parallelizing my own models. But I am a little bit unsure of

  1. which weights to be parallelized and
  2. how many GPUs should be used

for a better inference speed.

Architecture of XGLM-7.5B

root
├── model (XGLMModel)
│   ├── embed_tokens (Embedding) weight:[256008, 4096]
│   ├── embed_positions (XGLMSinusoidalPositionalEmbedding) weights:[2050, 4096]
│   ├── layers (ModuleList)
│   │   └── 0-31(XGLMDecoderLayer)
│   │       ├── self_attn (XGLMAttention)
│   │       │   └── k_proj,v_proj,q_proj,out_proj(Linear) weight:[4096, 4096] bias:[4096]
│   │       ├── self_attn_layer_norm,final_layer_norm(LayerNorm) weight:[4096] bias:[4096]
│   │       ├── fc1 (Linear) weight:[16384, 4096] bias:[16384]
│   │       └── fc2 (Linear) weight:[4096, 16384] bias:[4096]
│   └── layer_norm (LayerNorm) weight:[4096] bias:[4096]
└── lm_head (Linear) weight:[256008, 4096]

un-certainty avatar Apr 01 '22 09:04 un-certainty