cailun01
cailun01
> 你好,请问为什么SyncQueue.hpp中函数Count(),不用加上锁 我认为是作者写错了。Count函数也需要加锁。而且Count函数的功能和Size函数重复了。
> 1. Check LD_LIBRARY_PATH > 2. Doesn't support nvidia-docker, only runc > Please confirm the two questions 1. LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH: ``` /usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib ``` `libnvidia-ml.so`确实在以上路径中,但nvidia-smi还是找不到。 `find / -name libnvidia-ml.so`: ``` /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so...
> > > 1. Check LD_LIBRARY_PATH > > > 2. Doesn't support nvidia-docker, only runc > > > Please confirm the two questions > > > > > > >...
> Is there any dead symbol link which named `libnvidia-ml.so` or `libnvidia-ml.so.1` in your image? If so, remove them. 我用`find . -xtype l`或`symlinks .`都没有发现失效的符号链接。 我发现一个现象,我在host上的驱动安装目录`/usr/lib64`查找`libnvidia-ml`,发现有2个链接(`libnvidia-ml.so`, `libnvidia-ml.so.1`)和1个动态库(`libnvidia-ml.so.450.66`): ``` lrwxrwxrwx 1 root...
另外,有一个迂回的办法,用`LD_PRELOAD`这个环境变量手动加载动态库,就可以解决问题: `# LD_PRELOAD=/usr/local/nvidia/lib64/libnvidia-ml.so nvidia-smi` ``` Thu Jun 17 06:46:25 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile...
> nvidia-smi try to dlopen `libnvidia-ml.so.1`, What's the version of your gpu-manager? 我在master分支拉取的代码,生成的gpu-manager镜像是1.1.4: ``` REPOSITORY TAG IMAGE ID CREATED SIZE tkestack/gpu-manager 1.1.4 0a74a803da06 2 days ago 10.8 GB ```
> Please provides logs that contains `Mirror %s to %s` and `Vcuda %s 请问这是什么log?应该如何获取?
> 可以尝试检查文件 `/etc/nvidia-container-runtime/config.toml`中的 `ldconfig`值,默认是 `"@/sbin/ldconfig"` 改为 `"/sbin/ldconfig"` 试试? 感谢回复!GPU manager没有使用nvidia docker,所以在/etc/nvidia-container-runtime/config.toml检查也是没有效果的。
原来如此,感谢! 那么新建一个子进程运行mps_model_handler函数,与output_handler共享内存,应该就能解决了吧?
谢谢!