ColossalAI issues

[BUG]: Op hook leads to memory leak

## Describe the problem In version 0.1.7, I found that Op hook leads to memory leak. If you use the hook on nn.module, even though it's a dummy hook, more...

ver217

bug

known issue

[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount

2

### 🐛 Describe the bug When testing [DeTr on Colossal-Example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/detr), I encountered an issue that model with only DDP in situations: 1. `LEARNING_RATE=1e-4`, `world_size=4` 2. `LEARNING_RATE=2e-4`, `world_size=8` 3. `LEARNING_RATE=1e-4`, `world_size=8`...

BoxiangW

bug

[BUG]: ZeRO causes runtime error when use GRU and pack sequence

4

### 🐛 Describe the bug I run the following script and get `RuntimeError: Function CudnnRnnBackward0 returned an invalid gradient at index 1 - got [0] but expected shape compatible with...

yuxinyuan

bug

[BUG]: Zero returns fp16 tensors which causes RuntimeError

3

### 🐛 Describe the bug I run the following script and it reports `Found dtype Float but expected Half`. It turns out that `y_hat` is of type fp16, but `y`...

yuxinyuan

bug

[BUG]: ZeRO not Working with SGD Optimizer

3

### 🐛 Describe the bug ZeRO will keep throwing overflow if used together with momentum SGD in the [resnet example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet). The code works fine with all kinds of amp. ###...

FrankLeeeee

bug

[DOC]: Colab Tutorial Request

### 📚 The doc issue Hi Colossal-AI developers, Thank you for your amazing work! Would you consider creating a Colab tutorial page? I think it can allow users to experiment...

yuxuan-lou

documentation

[FEATURE]: example - Diffusion Model

1

### Describe the feature e.g. Latent Diffusion

binmakeswell

enhancement

[FEATURE]: example - stylegan_xl

4

### Describe the feature styleganXL是支持stylegan3,stylegan2ada的通用训练架构，代码也做了简化，用这个做案例会很棒，他家代码刚刚发布几周

binmakeswell

enhancement

Questions about log interpretation, seems paradoxical

6

This line of log confuses me, my batch size is 513 and iteration time is 98.83, so the throughput should be 5.19. Obviously, the logs of iteration time and throughput...

shjwudp

[FEATURE]: set ComputePattern for op rather than parameter

### Describe the feature We set spec on parameter now, which means each paramter has its own unchanged compute_pattern. However, some models, like GPT-2, share parameter among different layers. GPT-2...

ver217

enhancement

ColossalAI
ColossalAI copied to clipboard

Metadata

[BUG]: Op hook leads to memory leak

[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount

[BUG]: ZeRO causes runtime error when use GRU and pack sequence

[BUG]: Zero returns fp16 tensors which causes RuntimeError

[BUG]: ZeRO not Working with SGD Optimizer

[DOC]: Colab Tutorial Request

[FEATURE]: example - Diffusion Model

[FEATURE]: example - stylegan_xl

Questions about log interpretation, seems paradoxical

[FEATURE]: set ComputePattern for op rather than parameter

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard