ColossalAI issues

[BUG]: weird stuck while training

8

### Is there an existing issue for this bug? - [X] I have searched the existing issues ### 🐛 Describe the bug When training a language model with the GeminiPlugin,...

airlsyn

bug

[BUG]: Got nan during backward with zero2

9

### Is there an existing issue for this bug? - [X] I have searched the existing issues ### 🐛 Describe the bug My code is based on Open-Sora, and can...

flymin

bug

[feat] support zbv in mixtral benchmark;

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

duanjunwen

[NPU]support npu

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

flybird11111

[pre-commit.ci] pre-commit autoupdate

updates: - [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0) - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.1](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.1) - [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0)

pre-commit-ci[bot]

[FEATURE]: Is it Possible to integrate Liger-Kernel?

8

### Describe the feature https://github.com/linkedin/Liger-Kernel Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory...

airlsyn

enhancement

[zero bubble]support zbv all

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

flybird11111

FasterMoE shadow expert implement

1

Hello, i want to implement FasterMoE shadow expert base on ColossalAI-MoeHybridParallel. is it possible? how can i achieve it?

Guodanding

[BUG]: Unable to train on H20 machine

1

### Is there an existing issue for this bug? - [X] I have searched the existing issues ### 🐛 Describe the bug I want to use nvidia H20 machine to...

kaixinbear

bug

[doc] sequence parallel document

## 📝 What does this PR do? Supplementary comparison of the principles of sequence parallel, including ring-attention and Ulysess, and an explanation of their use cases.

wangbluo

ColossalAI
ColossalAI copied to clipboard

Metadata

[BUG]: weird stuck while training

[BUG]: Got nan during backward with zero2

[feat] support zbv in mixtral benchmark;

[NPU]support npu

[pre-commit.ci] pre-commit autoupdate

[FEATURE]: Is it Possible to integrate Liger-Kernel?

[zero bubble]support zbv all

FasterMoE shadow expert implement

[BUG]: Unable to train on H20 machine

[doc] sequence parallel document

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard