Jiatong (Julius) Han comments

Results 223 comments of


                                            Jiatong (Julius) Han

[BUG]: auto_parallel example failed with 2x3060 on the same node (Error: The new group's rank should be within the the world_size set by init_process_group)

Can @super-dainiu or @Cypher30 help answer this?

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined

It might happen due to mismatch of torch and cuda versions. Could you try reinstall torch via `conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch `

[BUG]: "The tensor has a non-zero number of elements, but its data is not allocated yet." when generating with ChatGPT

Can you initialize another variable as `torch.tensor(enc['input_ids']).reshape(1,-1).cuda()` before passing it into the `generate()` function? This error might happen due to unsuccessful allocation of CUDA memory.

[BUG]: runningCUDA_EXT=1 pip install . error: [Errno 2] No such file or directory: 'which': 'which'

Hi, PyTorch might not be compatible with your cuda 11.7. ([source](https://discuss.pytorch.org/t/latest-cuda-toolkit-release-11-7-is-it-compatible-with-pytorch/152824/4)) Can you please downgrade it or change to another environment?

[BUG]: OPT30B CUDA out of memory

The GPU memory is not enough. Please try smaller models such as opt1.3b and see if it works.

[BUG]: `please init model in the ColoInitContext` when using `"colossalai"` training strategy in Lightning AI

Can you try following the example [here](https://github.com/hpcaitech/ColossalAI/blob/5e4bced0a3fdcb790cda3811aa445f6691e468b1/examples/language/opt/train_gemini_opt.py#L167) and initiate your model under the `ColoInitContext`? 'Colossalai' strategy would require your model to have `coloparameters`.

Jiatong (Julius) Han

[BUG]: auto_parallel example failed with 2x3060 on the same node (Error: The new group's rank should be within the the world_size set by init_process_group)

[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined

[BUG]: "The tensor has a non-zero number of elements, but its data is not allocated yet." when generating with ChatGPT

[BUG]: runningCUDA_EXT=1 pip install . error: [Errno 2] No such file or directory: 'which': 'which'

[BUG]: OPT30B CUDA out of memory

[BUG]: `please init model in the ColoInitContext` when using `"colossalai"` training strategy in Lightning AI

[BUG]: math.prod is not always available

[BUG]: No module named 'chatgpt.nn'

[BUG]: colossalai check -i failed after installation both from PyPl and source

[BUG]: Repository Not Found for url: https://huggingface.co/pretrain/resolve/main/tokenizer_config.json