parallelformers
parallelformers copied to clipboard
Support for XGLM: How to achieve faster inference speed?
Describe a requested feature
Thanks for releasing this great library! I am currently working on deploying facebook/xglm-7.5B, which is currently not supported by parallelformers.
POLICY.md provides a comprehensive guide for parallelizing my own models. But I am a little bit unsure of
- which weights to be parallelized and
- how many GPUs should be used
for a better inference speed.
Architecture of XGLM-7.5B
root
├── model (XGLMModel)
│ ├── embed_tokens (Embedding) weight:[256008, 4096]
│ ├── embed_positions (XGLMSinusoidalPositionalEmbedding) weights:[2050, 4096]
│ ├── layers (ModuleList)
│ │ └── 0-31(XGLMDecoderLayer)
│ │ ├── self_attn (XGLMAttention)
│ │ │ └── k_proj,v_proj,q_proj,out_proj(Linear) weight:[4096, 4096] bias:[4096]
│ │ ├── self_attn_layer_norm,final_layer_norm(LayerNorm) weight:[4096] bias:[4096]
│ │ ├── fc1 (Linear) weight:[16384, 4096] bias:[16384]
│ │ └── fc2 (Linear) weight:[4096, 16384] bias:[4096]
│ └── layer_norm (LayerNorm) weight:[4096] bias:[4096]
└── lm_head (Linear) weight:[256008, 4096]