DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Unable to get report

Open yt6983138 opened this issue 2 years ago • 3 comments

Describe the bug After I did pip install deepspeed and ran ds_report, python -m deepspeed.env_report, they all show a bunch of errors. *Log output

aistudio@jupyter-3215566-6055193:~$ ds_report
Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/bin/ds_report", line 3, in <module>
    from deepspeed.env_report import cli_main
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/deepspeed/__init__.py", line 10, in <module>
    import torch
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

python -m deepspeed.env_report has same output. To Reproduce Steps to reproduce the behavior:

  1. pip install deepspeed
  2. ds_report or python -m deepspeed.env_report

Expected behavior Honestly I don't know I've never run it before. ds_report output Can't run it.

Screenshots I believe the log is enough.

System info (please complete the following information):

  • OS: Linux jupyter-3215566-6055193 4.15.0-140-generic #144-Ubuntu SMP Fri Mar 19 14:12:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • 1 machine with 1 v100
  • Python 3.7
  • I'm currently running it under a vm because I don't have machine level access
  • 16g ram, gold 6148 cpu 2 core because under vm
  • I'm running this at https://aistudio.baidu.com/, with a 16g vmem v100 gpu Docker context nope. Additional context nothing I guess.

yt6983138 avatar Apr 26 '23 11:04 yt6983138

@yt6983138 - were there any errors with the pip install deepspeed command? Also can you confirm you have torch installed prior to deepspeed?

loadams avatar Apr 28 '23 17:04 loadams

@yt6983138 - were there any errors with the pip install deepspeed command? Also can you confirm you have torch installed prior to deepspeed?

No, there's no error while installing deepspeed, I'll check about torch after I get home.

yt6983138 avatar Apr 29 '23 06:04 yt6983138

Hi @yt6983138, were you able to check if you had torch installed?

molly-smith avatar May 12 '23 18:05 molly-smith

Closing. Please reopen if the issue persists.

molly-smith avatar May 26 '23 18:05 molly-smith