ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: colossalai check -i error

Open Alternate-D opened this issue 2 years ago • 10 comments

🐛 Describe the bug

I installed colossalai0.2.5 successfully(from source not pypi), but the following problem occurred when I executed command "colossalai check -i", please help me. (Colossal-AI) ln01@ln01-System-Product-Name:/media/ln01/2t/usr/wy$ colossalai check -i Traceback (most recent call last): File "/home/ln01/anaconda3/envs/Colossal-AI/bin/colossalai", line 5, in from colossalai.cli import cli File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/init.py", line 1, in from .initialize import ( File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/initialize.py", line 18, in from colossalai.amp import AMP_TYPE, convert_to_amp File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/amp/init.py", line 9, in from .torch_amp import convert_to_torch_amp File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/amp/torch_amp/init.py", line 9, in from .torch_amp import TorchAMPLoss, TorchAMPModel, TorchAMPOptimizer File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 10, in from colossalai.nn.optimizer import ColossalaiOptimizer File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/init.py", line 1, in from ._ops import * File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/_ops/init.py", line 1, in from .addmm import colo_addmm File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/_ops/addmm.py", line 5, in from ._utils import GeneralTensor, Number, convert_to_colo_tensor File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/_ops/_utils.py", line 8, in from colossalai.nn.layer.utils import divide File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/layer/init.py", line 1, in from .colossalai_layer import * File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/layer/colossalai_layer/init.py", line 2, in from .dropout import Dropout File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/layer/colossalai_layer/dropout.py", line 4, in from ..parallel_1d import * File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/layer/parallel_1d/init.py", line 1, in from .layers import (Classifier1D, Dropout1D, Embedding1D, LayerNorm1D, Linear1D, Linear1D_Col, Linear1D_Row, File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/nn/layer/parallel_1d/layers.py", line 17, in from colossalai.kernel import LayerNorm File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel/init.py", line 1, in from .cuda_native import FusedScaleMaskSoftmax, LayerNorm, MultiHeadAttention File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel/cuda_native/init.py", line 1, in from .layer_norm import MixedFusedLayerNorm as LayerNorm File "/home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 12, in from colossalai.kernel.op_builder.layernorm import LayerNormBuilder ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'

Environment

Python3.7+CUDA11.7+torch1.13.1

Alternate-D avatar Feb 21 '23 08:02 Alternate-D

Hi, we have encountered several user feedback on this, one related issue is #2811

FrankLeeeee avatar Feb 21 '23 08:02 FrankLeeeee

I am looking into this issue, however, I cannot reproduce this bug on my machine. Is it possible for you to provide a dockerfile script to reproduce this in Docker? I would be more than happy to help if needed.

FrankLeeeee avatar Feb 21 '23 08:02 FrankLeeeee

I am looking into this issue, however, I cannot reproduce this bug on my machine. Is it possible for you to provide a dockerfile script to reproduce this in Docker? I would be more than happy to help if needed.

Sorry, I'm not familiar with the usage of Docker.

Alternate-D avatar Feb 21 '23 09:02 Alternate-D

Hi, we have encountered several user feedback on this, one related issue is #2811

Thanks, I'll refer to this.

Alternate-D avatar Feb 21 '23 09:02 Alternate-D

That's absolutely alright. If you could provide the output of the following bash commands, perhaps it can help me better locate the bug.

ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel
ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/

FrankLeeeee avatar Feb 21 '23 09:02 FrankLeeeee

------------------ 原始邮件 ------------------ 发件人: "Frank @.>; 发送时间: 2023年2月21日(星期二) 下午5:15 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [hpcaitech/ColossalAI] [BUG]: colossalai check -i error (Issue #2845)

That's absolutely alright. If you could provide the output of the following bash commands, perhaps it can help me better locate the bug. ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Alternate-D avatar Feb 21 '23 09:02 Alternate-D

@Alternate-D can I know whether you are running on windows?

FrankLeeeee avatar Feb 22 '23 09:02 FrankLeeeee

@Alternate-D can I know whether you are running on windows?

Linux, Ubuntu22.04

Alternate-D avatar Feb 23 '23 07:02 Alternate-D

@FrankLeeeee

That's absolutely alright. If you could provide the output of the following bash commands, perhaps it can help me better locate the bug.

ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/colossalai/kernel
ls /home/ln01/anaconda3/envs/Colossal-AI/lib/python3.7/site-packages/

I have solve this promblem by adding a link

site-packages/colossalai/kernel$ ln -s ../../op_builder op_builder

then the check command works

$ colossalai check -i
#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.5
PyTorch version: 1.12.0
System CUDA version: 10.2
CUDA version required by PyTorch: 10.2

seems like the reason for this error, but the origin reason is in the installing process. My env is

$ conda --version
conda 4.9.2
$ python -V
Python 3.9.12

project version is

$ git log
commit cd2b0eaa8dd4a7d8a67ce91b93459e07418bd741 (origin/main, origin/HEAD)
Author: YuliangLiu0306 <[email protected]>
Date:   Tue Mar 7 11:08:11 2023 +0800

    [DTensor] refactor sharding spec (#2987)
    
    * [autoparallel] refactor sharding spec
    
    * rename function name

bingo00 avatar Mar 07 '23 06:03 bingo00

Ok great, the root error is that somehow the symlink is not working. I am not able to reproduce this bug on our own machine. However, I am testing this on different OS using Docker. Possibly the usage of symlink is not what we desire and we will explore other implementations for this.

FrankLeeeee avatar Mar 07 '23 06:03 FrankLeeeee

Glad to hear it was resolved. Thanks.

binmakeswell avatar Apr 20 '23 07:04 binmakeswell