pytorch-gpt-x
pytorch-gpt-x copied to clipboard
Implementation of autoregressive language model using improved Transformer and DeepSpeed pipeline parallelism.
GPT-X
Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.
Improved Transformer
Transformer used in this repository attempts to improve the transformer using the additional modules below.
Name | Description | Link |
---|---|---|
Rezero | Rezero Is All You Need | link |
Explicit Sparse Transformer | Concentrated Attention Through Explicit Selection | link |
Macaron Architecture | Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View | link |
RealFormer | Residual Attention | link |
ALiBi Position Embedding | effective relative positional encoding |
Model Description
model_name | n_params | n_layer | d_model | n_heads | vocab_size | max_seq_len | learning_rate |
---|---|---|---|---|---|---|---|
GPT-X 1B | 1B | 20 | 2048 | 16 | 22000 | 1024 | 2.0 x 10^-4 |
DeepSpeed
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
Piepline Parallelism
You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).
GPU Usage
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 |
| N/A 42C P0 44W / 250W | 16076MiB / 16130MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 |
| N/A 45C P0 168W / 250W | 16060MiB / 16130MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 29525 C /home/ubuntu/anaconda3/bin/python 16065MiB |
| 1 29528 C /home/ubuntu/anaconda3/bin/python 16049MiB |
+-----------------------------------------------------------------------------+
Pipeline Parallelism Log
[2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=11
0: Embedding
1: ReZeroSparseTopkDecoder
2: ReZeroSparseTopkDecoder
3: ReZeroSparseTopkDecoder
4: ReZeroSparseTopkDecoder
5: ReZeroSparseTopkDecoder
6: ReZeroSparseTopkDecoder
7: ReZeroSparseTopkDecoder
8: ReZeroSparseTopkDecoder
9: ReZeroSparseTopkDecoder
10: ReZeroSparseTopkDecoder
stage=1 layers=12
11: ReZeroSparseTopkDecoder
12: ReZeroSparseTopkDecoder
13: ReZeroSparseTopkDecoder
14: ReZeroSparseTopkDecoder
15: ReZeroSparseTopkDecoder
16: ReZeroSparseTopkDecoder
17: ReZeroSparseTopkDecoder
18: ReZeroSparseTopkDecoder
19: ReZeroSparseTopkDecoder
20: ReZeroSparseTopkDecoder
21: LayerNorm
22: Linear
loss: cross_entropy
TODO
- [x] ~~ReZero~~
- [x] ~~RealFormer, Residual Attention~~
- [x] ~~Macaron architectures~~
- [x] ~~Macaron architectures - layer Scale 0.5~~
- [x] ~~Explicit Sparse Transformer~~
- [x] ~~torch lightning~~
- [x] ~~Deepspeed train on single GPU~~
- [x] apply wandb
- [x] Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory
Parameter For Few-shot
GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.
GPT-3 Config
model_name | n_params | n_layer | d_model | n_heads | d_head | batch_size | learning_rate |
---|---|---|---|---|---|---|---|
GPT-3 175B | 175B | 96 | 12288 | 96 | 128 | 3.2M | 0.6 x 10^-4 |
GPT-3 13B | 13B | 40 | 5140 | 40 | 128 | 2M | 1.0 x 10^-4 |
GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 128 | 2M | 1.2 x 10^-4 |
GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | 80 | 1M | 1.6 x 10^-4 |
GPT-3 1.3B | 1.3B | 24 | 2048 | 24 | 128 | 1M | 2.0 x 10^-4 |
Issue
-
AttributeError: module 'deepspeed' has no attribute 'zero'
: reinstall deepspeed -
userwarning: cuda initialization: the nvidia driver on your system is too old
: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
-
can't find CUDA_HOME path
: reinstall cuda
References
Transformer
DeepSpeed
ReZero
Explicit Sparse Transformer
Macaron Architecrue
RealFormer Residual Attention
DeepSpeed
Pipeline Parallelism