SIMULATeQCD icon indicating copy to clipboard operation
SIMULATeQCD copied to clipboard

fixed_podman_build

Open Greyyy-HJC opened this issue 2 years ago • 10 comments

Previously when I tried the original podman build, I failed because the GPU driver cannot be recognized inside the container.

In this change, I added another folder, including a new Dockerfile and instructions to build with container.

Thank you for the good project, hope you can check the change and accept it.

Cheers, Jinchen

Greyyy-HJC avatar Jan 11 '24 22:01 Greyyy-HJC

Hi @Greyyy-HJC , thank you for the contribution! @clarkedavida you use the container frequently, right? Can you double check whether these changes work for you?

lukas-mazur avatar Jan 22 '24 10:01 lukas-mazur

I'm sorry it took me so long to look at this. I only noticed yesterday that this was forwarded to me.

I have followed your instructions so far, and ran into this error:

docker run --name simqcd_container --hooks-dir=/usr/share/containers/oci/hooks.d/ --runtime=nvidia -it greyyyhjc/simqcd_cuda_11.2
unknown flag: --hooks-dir
See 'docker run --help'.

Does the command need to be updated?

clarkedavida avatar Jun 05 '24 23:06 clarkedavida

Hi Clarke,

Hope you are doing well, I almost forgot this pull request. So, the error that you sent seems caused by the new version (19.03 or later) of the docker, you can try below instead.

docker run --name simqcd_container --gpus all -it greyyyhjc/simqcd_cuda_11.2

Best, Jinchen

On Jun 5, 2024, at 19:50, D. A. Clarke @.***> wrote:

I'm sorry it took me so long to look at this. I only noticed yesterday that this was forwarded to me.

I have followed your instructions so far, and ran into this error:

docker run --name simqcd_container --hooks-dir=/usr/share/containers/oci/hooks.d/ --runtime=nvidia -it greyyyhjc/simqcd_cuda_11.2 unknown flag: --hooks-dir See 'docker run --help'. Does the command need to be updated?

— Reply to this email directly, view it on GitHub https://github.com/LatticeQCD/SIMULATeQCD/pull/150#issuecomment-2151135966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVOAQU6COVN4MCS2SYMAZLZF6P3HAVCNFSM6AAAAABBXHTCLOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGEZTKOJWGY. You are receiving this because you were mentioned.

Greyyy-HJC avatar Jun 06 '24 02:06 Greyyy-HJC

Thanks for your hints Jinchen, I am making good progress now. What is the difference between the From NVIDIA and ready2use builds? Is there a reason we need both?

Also, after following the ready2use instructions, I compiled memManTest and hit the following error while running:

# [2024-06-06 12:48:31] FATAL: A GPU error occured: _rawPointer: Failed to allocate (additional) 1.024e-06 GB of memory on host: no CUDA-capable device is detected ( cudaErrorNoDevice )
terminate called after throwing an instance of 'std::runtime_error'
  what():  A GPU error occured: _rawPointer: Failed to allocate (additional) 1.024e-06 GB of memory on host: no CUDA-capable device is detected ( cudaErrorNoDevice )

I do have an NVIDIA quadro p500 on this system. Before compiling, I cleared out the build folder and configured with architecture 61, which should be correct for this GPU. I should also mention if I compile SIMULATeQCD manually on this system everything works. Any ideas?

clarkedavida avatar Jun 06 '24 16:06 clarkedavida

Hi Clarke,

Glad to see you can make it successfully. We do not need both, just pick one from these two ways, I agree that I can make the readme clearer.

The difference is, “From NVIDIA” means take the image from NVIDIA as a base (which has smaller size), then build a new image via the Dockerfile (that I modified); while “ready2use” is the image that been built already. In short, if you take the “From NVIDIA” method, after you docker build successfully, you will get the same image as “ready2use”.

“From NVIDIA” means pulling a smaller size of image, but need to build via Dockerfile on your own; “ready2use” means pulling a larger image but no extra steps to build.

Best, Jinchen

On Jun 6, 2024, at 12:48, D. A. Clarke @.***> wrote:

Thanks for your hints Jinchen, I am making good progress now. What is the difference between the From NVIDIA and ready2use builds? Is there a reason we need both?

— Reply to this email directly, view it on GitHub https://github.com/LatticeQCD/SIMULATeQCD/pull/150#issuecomment-2152975812, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVOAQRFOYHHZYBPSPJO6YLZGCHFZAVCNFSM6AAAAABBXHTCLOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE3TKOBRGI. You are receiving this because you were mentioned.

Greyyy-HJC avatar Jun 06 '24 18:06 Greyyy-HJC

OK, any hints about the error I hit?

clarkedavida avatar Jun 06 '24 20:06 clarkedavida

OK, any hints about the error I hit?

Oh, I missed your error before, sorry. I just checked on my architecture 86 machine, I can make memManTest successfully, could you try another machine with different architecture? I am not sure about memManTest, does it have some requirement on hardware architecture?

Below is the output that I got.

Best, Jinchen

root@7e4148d81e9b:/buildsimqcd# make memManTest
Scanning dependencies of target memManTest
Building CUDA object CMakeFiles/memManTest.dir/src/testing/main_memManTest.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/gutils.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/memoryManagement.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/base/indexer/initGPUIndexer.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/indexer/initCPUIndexer.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/communication/communicationBase_mpi.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/IO/parameterManagement.cpp.o
Building CXX object CMakeFiles/memManTest.dir/src/base/IO/fileWriter.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/base/math/random.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/gauge/gaugefield_device.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/gauge/gaugefield.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/gauge/gaugeAction.cpp.o
Building CUDA object CMakeFiles/memManTest.dir/src/base/latticeContainer.cpp.o
Linking CUDA device code CMakeFiles/memManTest.dir/cmake_device_link.o
Linking CXX executable testing/memManTest
Built target memManTest

Greyyy-HJC avatar Jun 06 '24 21:06 Greyyy-HJC

I sent reply on github, not sure if it shows in email as well.

Best, Jinchen

On Jun 6, 2024, at 16:32, D. A. Clarke @.***> wrote:

OK, any hints about the error I hit?

— Reply to this email directly, view it on GitHub https://github.com/LatticeQCD/SIMULATeQCD/pull/150#issuecomment-2153358600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVOAQXL34DSHI564BOKHVLZGDBN5AVCNFSM6AAAAABBXHTCLOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJTGM2TQNRQGA. You are receiving this because you were mentioned.

Greyyy-HJC avatar Jun 07 '24 13:06 Greyyy-HJC

Will I need sudo privileges to use the container? Otherwise I only have my laptop that has a usable GPU.

clarkedavida avatar Jun 10 '24 03:06 clarkedavida

Podman and podman-hpc do not need sudo, and I think if the docker is installed on the cluster, it should be fine to use it without sudo. About that GPU setting, probably it has been set.

Best, Jinchen

On Jun 9, 2024, at 23:08, D. A. Clarke @.***> wrote:

Will I need sudo privileges to use the container? Otherwise I only have my laptop that has a usable GPU.

— Reply to this email directly, view it on GitHub https://github.com/LatticeQCD/SIMULATeQCD/pull/150#issuecomment-2157109759, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVOAQSBPDPRTIUDRMHOBTTZGUKC5AVCNFSM6AAAAABBXHTCLOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJXGEYDSNZVHE. You are receiving this because you were mentioned.

Greyyy-HJC avatar Jun 10 '24 03:06 Greyyy-HJC