FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

Failing to compile from master on ec2 instances.

Open HamidShojanazeri opened this issue 3 years ago • 9 comments

Description

Failing to build Faster Transformer from master locally. It throw error on missing nccl, running  on ec2 instances both P3 and P4d. 

### Environment

PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.4
Libc version: glibc-2.27

Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-1072-aws-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchvision==0.12.0+cu113
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.11.0+cu113             pypi_0    pypi
[conda] torchaudio                0.11.0+cu113             pypi_0    pypi
[conda] torchvision               0.12.0+cu113             pypi_0    pypi

Error

logs

/usr/bin/ld: cannot find -lnccl
collect2: error: ld returned 1 exit status
src/fastertransformer/th_op/gpt/CMakeFiles/th_gpt.dir/build.make:160: recipe for target 'lib/libth_gpt.so' failed
make[2]: *** [lib/libth_gpt.so] Error 1
CMakeFiles/Makefile2:5082: recipe for target 'src/fastertransformer/th_op/gpt/CMakeFiles/th_gpt.dir/all' failed
make[1]: *** [src/fastertransformer/th_op/gpt/CMakeFiles/th_gpt.dir/all] Error 2
Makefile:90: recipe for target 'all' failed
make: *** [all] Error 2

Nccl version check

(fresh_env) ubuntu@ip-172-31-48-37:~/FT_clean/FasterTransformer/build$  python -c "import torch;print(torch.cuda.nccl.version())"
(2, 10, 3)

I wonder if I am missing any step here.

Reproduced Steps

conda create -n my-env python=3.8
conda activate my-env
1. pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
2. git clone https://github.com/NVIDIA/FasterTransformer.git
3. vi ~/.bashrc
 export CUDA_HOME=/usr/local/cuda
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
 export PATH=$PATH:$CUDA_HOME/bin
 CUDACXX=/usr/local/cuda-11.3/bin/nvcc
4. cd FasterTransformer
5. mkdir -p build
6. cd build
7. cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..
8. make

HamidShojanazeri avatar May 25 '22 06:05 HamidShojanazeri

Can you try the docker we suggest in the document?

byshiue avatar May 25 '22 06:05 byshiue

Besides, do you use main branch? I remember we don't need nccl for gpt module.

byshiue avatar May 25 '22 06:05 byshiue

Can you try the docker we suggest in the document?

I am using the docker as the workaround but looking into building locally as well.

Besides, do you use main branch? I remember we don't need nccl for gpt module.

yes, I am using the main branch.

HamidShojanazeri avatar May 25 '22 06:05 HamidShojanazeri

I wonder if docker file is accessible on OSS?

HamidShojanazeri avatar May 25 '22 06:05 HamidShojanazeri

The dockers we use in the document are open on NGC.

byshiue avatar May 25 '22 06:05 byshiue

can you please point me to the repo to access the Dockerfile, it can be helpful to mimic the env.

HamidShojanazeri avatar May 25 '22 06:05 HamidShojanazeri

We don't have Dockerfile. We use the docker image of NGC like nvcr.io/nvidia/pytorch:22.03-py3 directly.

byshiue avatar May 25 '22 07:05 byshiue

Yes, thanks for clarification, I am using the docker image, just thought it might be possible to access Dockerfile on OSS as well.

I would appreciate if you have any suggestion for debugging.

HamidShojanazeri avatar May 25 '22 07:05 HamidShojanazeri

The docker is open in NGC, you could pull it directly.

byshiue avatar May 30 '22 07:05 byshiue

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.

byshiue avatar Sep 06 '22 01:09 byshiue