lbann
lbann copied to clipboard
How to generate prototext?
I am trying to run the sample file provided in the repo: https://github.com/LLNL/lbann/blob/develop/applications/vision/lenet.py using the command: mpiexec lbann --model=lenet.prototext --reader=https://github.com/LLNL/lbann/tree/develop/applications/vision/data/mnist/data_reader.prototext.
Now I want to generate the lenet.prototext from the given lenet.py. Is this possible or am I missing something here? I just want to train the provided lenet on the mnist dataset.
if I just try: python3 lenet.py
I get errors as: RuntimeError: could not detect job scheduler
.
Try replacing the lbann.contrib.launcher.run
with lbann.proto.save_prototext
:
https://github.com/LLNL/lbann/blob/9c94701e30b83a76c252e1a0b4df97b2b7d11021/python/lbann/proto.py#L7
Something like:
lbann.proto.save_prototext(prototext_file,
trainer=trainer,
model=model,
data_reader=data_reader,
optimizer=opt)
The Python frontend assumes you are running LBANN on a system that uses SLURM or LSF job managers. We should add a fallback for MPI.
@timmoon10 Yes, that works and now I am running lbann as: mpiexec lbann --prototext=exp.prototext
. exp.prototext has been generated using the command above in the provided lenet.py file.
The training seems to run but I am stuck here for an hour now:
[0] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157]
global MB = [ 64/ 64/ 64] global last MB = [ 48 / 48 / 16 ]
local MB = [ 64/ 64/ 64] local last MB = [ 48+0/ 48+0/ 16+0
Is this expected or am I doing something wrong? I am not using GPU in this case but there is no progress bar of any sort so I am not sure if the model is training or not.
An hour seems really excessive for LeNet. I suspect something is hanging. It's odd, since it should just run with one MPI rank if you don't pass in extra arguments.
Can you add lbann.CallbackDebug
at
https://github.com/LLNL/lbann/blob/1b1e3198853566f7417a1dd2477d2e6c4217e6e7/applications/vision/lenet.py#L79
This will printf at the beginning and end of every layer. That can give us an idea of what's hanging.
@timmoon10 @benson31 I noticed that line was already added. I have attached my log which contains whatever gets printed in the console. I terminated the process because it gets stuck after starting the epoch and takes a long time to process. log.txt
Another question here is, how to use GPU for running lbann framework? I installed lbann+cuda now using: spack install lbann+cuda+nccl~al
and I loaded the cuda modules, but when I run the prototext file it does not detect cuda since the epochs take a long time in general. Is there anything else that needs to be done for the process to run on a gpu?
I don't see the debug callback in the log. At the line I gave you, we configure the model with three callbacks to print the model description, metrics, and times. Can you add a fourth callback (lbann.CallbackDebug
) to the list? Also, it looks like you're running three instances of LBANN at the same time, each one running with 1 MPI rank? I don't think it should cause problems (other than mangling the log file), but I'm wondering if something is misconfigured.
When we move on to running with GPUs, can you try building with cuDNN and Aluminum enabled? cuDNN is required for GPU support and Aluminum is highly recommended for GPU communication.
@timmoon10 to run on GPU I am building: spack install [email protected]+cuda+nccl ^[email protected] ^[email protected] ^conduit~fortran ^[email protected]
But after installing and loading the modules using: spack load lbann-(packagename) (also loading aluminum, hydrogen,cuda and cudnn using spack), I get a different error as follows:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'hydrogen::CUDAError'
what(): Assertion
h_check_cuda_error_code__ == cudaSuccess
in function
void hydrogen::gpu::SetDevice(int)
failed!
{
File: /tmp/pbstmp.11449746/jrodmanu/spack-stage/spack-stage-hydrogen-1.4.0-csf4wubjm674hlj6emasox7emmnwbqfl/spack-src/src/hydrogen/device/CUDA.cpp
Line: 91
Mesg: CUDA error detected in command: "cudaSetDevice(device_id)"
Error Code: 3
Error Name: cudaErrorInitializationError
Error Mesg: initialization error
}
I'm not too familiar with the build system and the main developer is on vacation for the rest of the week, but I'll give it a shot. Is your Spack environment or system environment configured correctly? It looks like CUDA is picking up the wrong Nvidia driver, so maybe you need to add /usr/lib
or /usr/lib64
to your LD_LIBRARY_PATH
before running LBANN. If that doesn't fix it, I would try getting a simple "hello world" CUDA program to work with the CUDA installation in your Spack environment.
Pinging @benson31 and @bvanessen.
I did but still getting the same error. It might be a version problem but the following shows when I do module list
:
1) xalt/latest 9) lbann-0.101-gcc-8.4.0-3uaigsn
2) gcc-compatibility/8.4.0 10) hydrogen-1.4.0-gcc-8.4.0-smil2g7
3) intel/19.0.5 11) hydrogen-1.4.0-gcc-8.4.0-csf4wub
4) modules/sp2020 12) hydrogen-1.4.0-gcc-8.4.0-cssa7de
5) lbann-0.101-gcc-8.4.0-pdo7mw4 13) openmpi/4.0.3
6) hydrogen-1.4.0-gcc-8.4.0-vhazpqq 14) aluminum-0.4.0-gcc-8.4.0-rrqoi7d
7) aluminum-0.4.0-gcc-8.4.0-vpq3wyz 15) nccl-2.7.8-1-gcc-8.4.0-47lyinw
8) cudnn-8.0.4.30-11.0-linux-x64-gcc-8.4.0-n2fy4nf 16) cuda/10.2.89
I have tried experimenting with different versions of lbann so there are quite a few modules for that. I have loaded all of them using modules. Is that causing a problem?
I tested a sample 'hello world' in cuda and that works!
__global__ void cuda_hello(){
printf("Hello World from GPU!\n");
}
int main() {
cuda_hello<<<1,1>>>();
return 0;
}
Your setup looks sensible to me. In my workflow I build the dependencies in a Spack environment and build LBANN with CMake, and I just need to load one modulefile before running LBANN:
. ${spack_root}/share/spack/setup-env.sh
spack env activate -p lbann-dev-power9le
module use ${module_dir}
module load lbann-0.102.0
It's different than your setup since I'm using a modulefile produced by LBANN rather than Spack, so I'm unsure how applicable this is.
If we want to wait on debugging the GPU build issues until @benson31 gets back, we can try working out the hang in the non-GPU version instead.