PaddleX icon indicating copy to clipboard operation
PaddleX copied to clipboard

关于docker中使用PaddleX的问题

Open USER-HFC opened this issue 1 year ago • 1 comments

描述问题

我希望自定义一个docker镜像,当我使用以下dockerfile进行构建,执行到RUN paddlex --install报错

File name: Dockerfile

FROM rayproject/ray-ml:2.30.0-py310-gpu

# 设置 pip 使用清华源
ENV PIP_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple
ENV VLLM_USE_MODELSCOPE=True

# 设置 conda 使用清华源
RUN conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ \
    && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ \
    && conda config --set show_channel_urls yes


# 安装paddlegpu 和 paddleX

RUN  python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

COPY ./pdpd/PaddleX /init/PaddleX
RUN pip install -e /init/PaddleX
RUN paddlex --install

# -----------------------------Base环境END-------------------------------------------------#
USER root

# 安装ssl证书
RUN apt-get update && apt-get install -y ca-certificates && update-ca-certificates

# 配置为默认证书
ENV REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

# 复制证书
COPY minio.crt /usr/local/share/ca-certificates/minio.crt

# 更新证书文件
RUN update-ca-certificates

USER ray

错误信息:


=> ERROR [18/21] RUN paddlex --install                                                                                   0.4s
------
 > [18/21] RUN paddlex --install:
0.392 Error: Can not import paddle core while this file exists: /home/ray/anaconda3/lib/python3.10/site-packages/paddle/base/libpaddle.so
0.412 Traceback (most recent call last):
0.412   File "/home/ray/anaconda3/bin/paddlex", line 33, in <module>
0.412     sys.exit(load_entry_point('paddlex', 'console_scripts', 'paddlex')())
0.412   File "/home/ray/anaconda3/bin/paddlex", line 25, in importlib_load_entry_point
0.412     return next(matches).load()
0.412   File "/home/ray/anaconda3/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
0.412     module = import_module(match.group('module'))
0.412   File "/home/ray/anaconda3/lib/python3.10/importlib/__init__.py", line 126, in import_module
0.412     return _bootstrap._gcd_import(name[level:], package, level)
0.412   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
0.412   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
0.412   File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
0.412   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
0.412   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
0.412   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
0.412   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
0.412   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
0.412   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
0.412   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
0.412   File "/init/PaddleX/paddlex/__init__.py", line 20, in <module>
0.412     from .modules import build_dataset_checker, build_trainer, build_evaluater, build_predictor
0.412   File "/init/PaddleX/paddlex/modules/__init__.py", line 16, in <module>
0.412     from .base import build_dataset_checker, build_trainer, build_evaluater, build_predictor, create_model, \
0.412   File "/init/PaddleX/paddlex/modules/base/__init__.py", line 18, in <module>
0.413     from .trainer import build_trainer, BaseTrainer, BaseTrainDeamon
0.413   File "/init/PaddleX/paddlex/modules/base/trainer/__init__.py", line 17, in <module>
0.413     from .trainer import build_trainer, BaseTrainer
0.413   File "/init/PaddleX/paddlex/modules/base/trainer/trainer.py", line 19, in <module>
0.413     from ..build_model import build_model
0.413   File "/init/PaddleX/paddlex/modules/base/build_model.py", line 18, in <module>
0.413     from ...utils.device import get_device
0.413   File "/init/PaddleX/paddlex/utils/device.py", line 16, in <module>
0.413     import paddle
0.413   File "/home/ray/anaconda3/lib/python3.10/site-packages/paddle/__init__.py", line 33, in <module>
0.413     from .base import core  # noqa: F401
0.413   File "/home/ray/anaconda3/lib/python3.10/site-packages/paddle/base/__init__.py", line 38, in <module>
0.413     from . import (  # noqa: F401
0.413   File "/home/ray/anaconda3/lib/python3.10/site-packages/paddle/base/backward.py", line 25, in <module>
0.413     from . import core, framework, log_helper, unique_name
0.413   File "/home/ray/anaconda3/lib/python3.10/site-packages/paddle/base/core.py", line 384, in <module>
0.413     raise e
0.413   File "/home/ray/anaconda3/lib/python3.10/site-packages/paddle/base/core.py", line 267, in <module>
0.413     from . import libpaddle
0.413 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
------
Dockerfile:87
--------------------
  85 |     COPY ./pdpd/PaddleX /init/PaddleX
  86 |     RUN pip install -e /init/PaddleX
  87 | >>> RUN paddlex --install
  88 |     
  89 |     # -----------------------------Base环境END-------------------------------------------------#
--------------------
ERROR: failed to solve: process "/bin/bash -c paddlex --install" did not complete successfully: exit code: 1

我尝试在dockerfile加入paddle-gpu的验证,发现同样是找不到 libcuda.so.1的错误 之后 我尝试手动执行 paddlex --install 于是有以下过程,我发现手动在容器内执行

root@xdzl-4090:/dev/data_16T/project/xd_ai/portal/images# docker run --rm -it rayproject/ray-ml:2.30.0-py310-gpu bash

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

(base) ray@eed194f1a782:~$ RUN  python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
bash: RUN: command not found
(base) ray@eed194f1a782:~$ python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/c
u118/(base) ray@eed194f1a782:~$ python -c "import paddle; paddle.utils.run_check()" 
Running verify PaddlePaddle program ... 
I0926 00:56:08.805682    55 program_interpreter.cc:243] New Executor is Running.
W0926 00:56:08.807749    55 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.5, Runtime API Version: 11.8
W0926 00:56:08.808048    55 gpu_resources.cc:164] device: 0, cuDNN Version: 8.7.
I0926 00:56:08.961114    55 interpreter_util.cc:648] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

我怀疑可能是在docker构建镜像时没有gpu,而在容器内是有gpu导致的,但install这个操作应该没有必要使用gpu吧 有好的解决方案吗

USER-HFC avatar Sep 26 '24 08:09 USER-HFC