Insu Jang issues

Results 7 issues of


                                            Insu Jang

[shardformer] fix pipeline forward error if custom layer distribution is used

…distribution ## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A...

[BUG]: shardformer: pipeline forward error with customized layer distribution

### 🐛 Describe the bug Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:...

bug

[BUG]: Failed to load HuggingFace pretrained checkpoint with LazyInitContext

### 🐛 Describe the bug **Using `LazyInitContext` and later loading checkpoint do not properly initialize model parameters.** ```python import colossalai from colossalai.lazy import LazyInitContext from colossalai.booster import Booster from colossalai.booster.plugin...

bug

[BUG] [Shardformer]: Error in blip2 testing with half precision

### 🐛 Describe the bug 1. It seems blip2 testing doesn't work correctly at all if model is half precision (torch.float16). 2. With bfloat16, `colossalai.shardformer.layer.FusedLayerNorm` doesn't seem to work correctly....

bug

[BUG]: HybridParallelOptimizer holds unsharded model parameters after sharding

### 🐛 Describe the bug When using tensor parallelism, model parameters are sharded across GPUs to reduce its memory consumption and parallel execution. However, the optimizer still holds unsharded model...

bug

[BUG]: OOM during llama2 pretraining with flashattention and PP

### 🐛 Describe the bug I understand that this error came out of flash attention software stack, but it seems there is no related issue except for #https://github.com/Dao-AILab/flash-attention/issues/590, therefore I...

bug

Implement pipeline merge/node borrow

During handling failures, if some pipeline doesn't have enough number of nodes, Oobleck is supposed to borrow nodes from other pipelines or merge pipelines. Previous implementation had a prototype implementation,...