Not able to run 8K context length even with multiple GPUs.

Open KOVVURISATYANARAYANAREDDY opened this issue 2 years ago • 1 comments

Hello, i am using Qlora.py to train starcoder model using full context length 8K.

but not happening on single GPU, i am using 40GB A100 Machine.

so tried to use multiple GPUs, by setting CUDA_VISIBILE_DEVICES = "0,1,2,3,4,5"

but still can't use multiple gpus only a single gpu is getting used atmost.

so i tried device_map for starcoder as follows.

     device_map = {
         'transformer.wte': 0,
         'transformer.wpe': 0,
         'transformer.drop': 0,
         'transformer.h.0': 0,
         'transformer.h.1': 0,
         'transformer.h.2': 1,
         'transformer.h.3': 1,
         'transformer.h.4': 1,
         'transformer.h.5': 1,
         'transformer.h.6': 1,
         'transformer.h.7': 1,
         'transformer.h.8': 1,
         'transformer.h.9': 1,
         'transformer.h.10': 2,
         'transformer.h.11': 2,
         'transformer.h.12': 2,
         'transformer.h.13': 2,
         'transformer.h.14': 2,
         'transformer.h.15': 2,
         'transformer.h.16': 2,
         'transformer.h.17': 3,
         'transformer.h.18': 3,
         'transformer.h.19': 3,
         'transformer.h.20': 3,
         'transformer.h.21': 3,
         'transformer.h.22': 3,
         'transformer.h.23': 3,
         'transformer.h.24': 3,
         'transformer.h.25': 4,
         'transformer.h.26': 4,
         'transformer.h.27': 4,
         'transformer.h.28': 4,
         'transformer.h.29': 4,
         'transformer.h.30': 4,
         'transformer.h.31': 4,
         'transformer.h.32': 4,
         'transformer.h.33': 5,
         'transformer.h.34': 5,
         'transformer.h.35': 5,
         'transformer.h.36': 5,
         'transformer.h.37': 5,
         'transformer.h.38': 5,
         'transformer.h.39': 5,
         'transformer.ln_f': 5,
         'lm_head':0
        }

now i think model is getting loaded into multiple gpus, and was able to train only 6k context length even with 4-bit quantization.

the main question is if i am able to train 2K on single gpu, even i increase my gpu by 6X I am not able to run 8K context length.

is the device_map working correct? if yes what am i missing? Please suggest.

Jul 12 '23 06:07 KOVVURISATYANARAYANAREDDY

The memory requirement increases exponentially as the sequence length gets longer. The usage of flash attention can help with this.

My repo here uses Qlora with flash attention for llama models. https://github.com/mallorbc/Finetune_LLMs

Sep 20 '23 23:09 mallorbc