ColossalAI issues

[BUG]: OOM during llama2 pretraining with flashattention and PP

3

### 🐛 Describe the bug I understand that this error came out of flash attention software stack, but it seems there is no related issue except for #https://github.com/Dao-AILab/flash-attention/issues/590, therefore I...

insujang

bug

[FEATURE]: Upgrade the transformers version from 4.33.0 to 4.36.0 for Shardformer.

1

### Describe the feature Shardformer was originally developed based on transformers==4.33.0. In response to our users' needs, it needs to be upgraded to version 4.36.0. The main changes involve the...

wangbluo

enhancement

[FEATURE]: pretrain data example

### Describe the feature can somebody give out the example of pretrian data format

alphanlp

enhancement

Allow building cuda extension without a device.

## 🚨 Issue number fixes #5534 ## 📝 What does this PR do? Added `FORCE_CUDA` environment variable support, to enable building extensions where a GPU device is not present but...

ccoulombe

[BUG]: Cannot build extensions when no gpu device exists

### 🐛 Describe the bug When no GPU device exists, such as CI or build nodes, no extensions can be built since `torch.cuda.is_available` checks for a device and not if...

ccoulombe

bug

[BUG]: Coati Lora incompatible with Gemini & HybridParallel(pp=1), but runs well with HybridParallel(tp>=2)

1

### 🐛 Describe the bug ## Description I implemented `Coati Lora` before parallel fine-tuning for LlaMA-7B, and found: - `Gemini` runs into _Error(s) in loading state_dict for GeminiCheckpointIO:_ and Train...

Fallqs

bug

[chat] use chunked MDP

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

cwher

[FEATURE]: support dit in Shardformer

### Describe the feature the dit model is the basic model to form sora , consider to suppport layer Parallel in ColossalAI ?

likelyzhao

enhancement

Fixes bug by raising exception on size mismatch

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...

KimbingNg

[BUG]: Size mismatch is ignored when loading checkpoint

### 🐛 Describe the bug In [this code block](https://github.com/hpcaitech/ColossalAI/blob/6df844b8c4946c734115b7e180b292888d857bc1/colossalai/checkpoint_io/utils.py#L560), when size mismatch occurs, no error message is printed. Fix: RuntimeError should be raise when `len(error_msgs) > 0` ### Environment _No...

KimbingNg

bug

ColossalAI
ColossalAI copied to clipboard

Metadata

[BUG]: OOM during llama2 pretraining with flashattention and PP

[FEATURE]: Upgrade the transformers version from 4.33.0 to 4.36.0 for Shardformer.

[FEATURE]: pretrain data example

Allow building cuda extension without a device.

[BUG]: Cannot build extensions when no gpu device exists

[BUG]: Coati Lora incompatible with Gemini & HybridParallel(pp=1), but runs well with HybridParallel(tp>=2)

[chat] use chunked MDP

[FEATURE]: support dit in Shardformer

Fixes bug by raising exception on size mismatch

[BUG]: Size mismatch is ignored when loading checkpoint

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard