scuda icon indicating copy to clipboard operation
scuda copied to clipboard

Is it possible to support vGPU?

Open Fruneng opened this issue 10 months ago • 7 comments

like https://github.com/Project-HAMi/HAMi-core

Fruneng avatar Jan 14 '25 01:01 Fruneng

Yes, support for the vGPU API should be possible, however unfortunately we don't actually have any GPUs that support it to develop and test with. If you have one, I believe the nvml API needs to be annotated correctly: https://docs.nvidia.com/deploy/nvml-api/group__nvmlVirtualGpuQueries.html

The annotations can be found here: https://github.com/kevmo314/scuda/blob/main/codegen/annotations.h

I don't know of a good test case for vGPU's though, ideally a very minimal binary that runs through the APIs would make verification easier.

kevmo314 avatar Jan 14 '25 07:01 kevmo314

Do you have any more complex cases that can run? Currently, I can only execute the simplest nvidia-smi command.

build image

docker build . -f Dockerfile.build -t scuda-builder-12.6.0 \
            --build-arg CUDA_VERSION=12.6.0 \
            --build-arg DISTRO_VERSION=22.04 \
            --build-arg OS_DISTRO=ubuntu \
            --build-arg CUDNN_TAG=cudnn

create docker network

docker network create scuda

start server

docker run -it --rm --gpus=all -p 14833:14833  --name scuda-server --network scuda  scuda-builder-12.6.0  /bin/bash -c "./local.sh server"

start client

docker run -it --rm --name scuda-client --network scuda  scuda-builder-12.6.0  /bin/bash 

test nvidia-smi

docker cp $(which nvidia-smi) scuda-client:/home/nvidia-smi

docker exec -it scuda-client /bin/bash -c "SCUDA_SERVER=scuda-server LD_PRELOAD=./libscuda_12.6.so ./nvidia-smi"

>Segfault handler installed.
>Wed Jan 15 01:48:43 2025
>+-----------------------------------------------------------------------------------------+
>| NVIDIA-SMI 560.27                 Driver Version: 560.70         CUDA Version: 12.6     |
>|-----------------------------------------+------------------------+----------------------+
>| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
>| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
>|                                         |                        |               MIG M. |
>|=========================================+========================+======================|
>|   0  Quadro P2000                   On  |   00000000:01:00.0 Off |                  N/A |
>| 44%   29C    P8              5W /   75W | Uninitialized          |      0%      Default |
>|                                         |                        |                  N/A |
>+-----------------------------------------+------------------------+----------------------+
>
>+-----------------------------------------------------------------------------------------+
>| Processes:                                                                              |
>|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
>|        ID   ID                                                               Usage      |
>|=========================================================================================|
>|  No running processes found                                                             |
>+-----------------------------------------------------------------------------------------+

test cuda api (aborted)

(base) ➜  ~ docker exec -it scuda-client /bin/bash -c "nvcc test/cublas_unified.cu -g -o cublas_unified -lcublas -L/usr/
local/cuda/lib64"
(base) ➜  ~ docker exec -it scuda-client /bin/bash -c "SCUDA_SERVER=scuda-server LD_PRELOAD=./libscuda_12.6.so cuda-gdb ./cublas_unified"
NVIDIA (R) cuda-gdb 12.6
Portions Copyright (C) 2007-2024 NVIDIA Corporation
Based on GNU gdb 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This CUDA-GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://forums.developer.nvidia.com/c/developer-tools/cuda-developer-tools/cuda-gdb>.
Find the CUDA-GDB manual and other documentation resources online at:
    <https://docs.nvidia.com/cuda/cuda-gdb/index.html>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cublas_unified...
(cuda-gdb) run
Starting program: /home/cublas_unified
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7eb8d83 in std::__detail::_Mod_range_hashing::operator()(unsigned long, unsigned long) const ()
   from ./libscuda_12.6.so
(cuda-gdb) bt
#0  0x00007ffff7eb8d83 in std::__detail::_Mod_range_hashing::operator()(unsigned long, unsigned long) const ()
   from ./libscuda_12.6.so
#1  0x00007ffff7f85170 in std::__detail::_Hash_code_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*>, std::__detail::_Select1st, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, true>::_M_bucket_index(unsigned long, unsigned long) const () from ./libscuda_12.6.so
#2  0x00007ffff7f84e81 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_bucket_index(unsigned long) const () from ./libscuda_12.6.so
#3  0x00007ffff7f84b87 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from ./libscuda_12.6.so
#4  0x00007ffff7f8450f in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, void*> > >::find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from ./libscuda_12.6.so
#5  0x00007ffff7f74c26 in get_function_pointer(char const*) () from ./libscuda_12.6.so
#6  0x00007ffff7eb8b94 in dlsym () from ./libscuda_12.6.so
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00007fffd267c356 in ?? () from /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12
#8  0x00007ffff7fc947e in ?? () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7fc9568 in ?? () from /lib64/ld-linux-x86-64.so.2
#10 0x00007ffff7fe32ca in ?? () from /lib64/ld-linux-x86-64.so.2
#11 0x0000000000000001 in ?? ()
#12 0x00007fffffffe3ca in ?? ()
#13 0x0000000000000000 in ?? ()

Fruneng avatar Jan 15 '25 01:01 Fruneng

You can find our test suite here which covers the cases that currently work and we've verified that they work: https://github.com/kevmo314/scuda/blob/main/local.sh#L24

We are still working through all the APIs though, admittedly this repo gained visibility much faster than we have been able to wire them all up together :)

Most of the APIs only require some tweaks in the annotations file, although getting used to knowing which tweaks need to be made is a bit of an art right now. Some improved debugging tools are also on the roadmap.

kevmo314 avatar Jan 15 '25 02:01 kevmo314

Yes, support for the vGPU API should be possible, however unfortunately we don't actually have any GPUs that support it to develop and test with. If you have one, I believe the nvml API needs to be annotated correctly: https://docs.nvidia.com/deploy/nvml-api/group__nvmlVirtualGpuQueries.html

@kevmo314 What I'm referring to with vGPU is not the NVIDIA official MIG device. It's a technology that similarly use the Linux PRELOAD for its implementation. This technology is realized by the project found at Project-HAMi/HAMi-core. Also, it's very useful for GPU pooling in data centers.

HAMi-core usercase:

export LD_PRELOAD=./libvgpu.so
export CUDA_DEVICE_MEMORY_LIMIT=1g
export CUDA_DEVICE_SM_LIMIT=50

nvidia-smi
>| 44%   29C    P8              5W /   75W|      0 MiB /   1024 MiB |      0%      Default |

Fruneng avatar Jan 16 '25 02:01 Fruneng

@Fruneng 我也有这方面的需求,我在考虑如何将 Scuda 与 HAMi-core 进行集成,使得 gpu 具备池化的能力。

I also have needs in this regard. I am thinking about how to integrate Scuda with HAMi-core so that the gpu has pooling capabilities.

silenceli avatar Jan 16 '25 07:01 silenceli

@silenceli 太好了 我们可以讨论一下如何实现

Fruneng avatar Jan 16 '25 08:01 Fruneng

@silenceli 太好了 我们可以讨论一下如何实现

可以加个微信聊一聊 :-) ,微信名:silenceli_1988

silenceli avatar Jan 17 '25 01:01 silenceli