ColossalAI issues

[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin

2

### Is there an existing issue for this bug? - [x] I have searched the existing issues ### The bug has not been fixed in the latest main branch -...

Hugo-cell111

bug

When will ColossalAI support 2D, 2.5D, 3D tensor parallelism?

I've noticed that the latest version of ColossalAI does not support 2D, 2.5D, and 3D tensor parallelism. I would like to know, according to ColossalAI's roadmap, when Shardformer will support...

AmazDeng

flux lora train support

4

@FrankLeeeee @gothicx @tiansiyuan @jeffra Does ColossalAI support training Flux model? For example, if I'm using a LoRA paradigm and need to redefine the processor within Flux, is this training method...

AmazDeng

How to set different parameter learning rate for Hybrid Adam?

2

crepuscularlight

fix: wrong dp-rank condition when enable pp

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

liuqh16

[FEATURE]: Add more training models and RLHF algorithms

1

### Describe the feature Add more training models and RLHF algorithms for the branch `grpo-latest`.

sglucas

enhancement

[BUG]: 在NPU卡执行example/language/llama7B，torch.npu.current_device()调用出错

1

### Is there an existing issue for this bug? - [x] I have searched the existing issues ### The bug has not been fixed in the latest main branch -...

upwindfly

bug

[FEATURE]: add master_weights arg to HybridParallelPlugin

1

### Describe the feature When using CPU offload, setting master_weights=False in both GeminiPlugin and LowLevelZeroPlugin can reduce GPU memory usage and improve speed. Does HybridParallelPlugin also support this feature?

eiPI1-0

enhancement

fix: resolve multi-node training hanging in Kubernetes environments

## Description Addresses issue #6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes. ## Root Cause - Missing rendezvous backend configuration in torchrun - No...

amyanger

fix a broken link

## 📝 What does this PR do? fix a broken link - I hope this is the right new location

stas00

ColossalAI
ColossalAI copied to clipboard

Metadata

[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin

When will ColossalAI support 2D, 2.5D, 3D tensor parallelism?

flux lora train support

How to set different parameter learning rate for Hybrid Adam?

fix: wrong dp-rank condition when enable pp

[FEATURE]: Add more training models and RLHF algorithms

[BUG]: 在NPU卡执行example/language/llama7B，torch.npu.current_device()调用出错

[FEATURE]: add master_weights arg to HybridParallelPlugin

fix: resolve multi-node training hanging in Kubernetes environments

fix a broken link

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard