lbann icon indicating copy to clipboard operation
lbann copied to clipboard

How to generate prototext?

Open gitosu67 opened this issue 4 years ago • 9 comments

I am trying to run the sample file provided in the repo: https://github.com/LLNL/lbann/blob/develop/applications/vision/lenet.py using the command: mpiexec lbann --model=lenet.prototext --reader=https://github.com/LLNL/lbann/tree/develop/applications/vision/data/mnist/data_reader.prototext. Now I want to generate the lenet.prototext from the given lenet.py. Is this possible or am I missing something here? I just want to train the provided lenet on the mnist dataset.

if I just try: python3 lenet.py I get errors as: RuntimeError: could not detect job scheduler.

gitosu67 avatar Oct 21 '20 04:10 gitosu67

Try replacing the lbann.contrib.launcher.run with lbann.proto.save_prototext:

https://github.com/LLNL/lbann/blob/9c94701e30b83a76c252e1a0b4df97b2b7d11021/python/lbann/proto.py#L7

Something like:

lbann.proto.save_prototext(prototext_file,
                           trainer=trainer,
                           model=model,
                           data_reader=data_reader,
                           optimizer=opt)

The Python frontend assumes you are running LBANN on a system that uses SLURM or LSF job managers. We should add a fallback for MPI.

timmoon10 avatar Oct 21 '20 17:10 timmoon10

@timmoon10 Yes, that works and now I am running lbann as: mpiexec lbann --prototext=exp.prototext. exp.prototext has been generated using the command above in the provided lenet.py file.

The training seems to run but I am stuck here for an hour now:

[0] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157]
global MB = [  64/  64/  64] global last MB = [  48  /  48  /  16  ]
local MB = [  64/  64/  64]  local last MB = [  48+0/  48+0/  16+0

Is this expected or am I doing something wrong? I am not using GPU in this case but there is no progress bar of any sort so I am not sure if the model is training or not.

gitosu67 avatar Oct 21 '20 18:10 gitosu67

An hour seems really excessive for LeNet. I suspect something is hanging. It's odd, since it should just run with one MPI rank if you don't pass in extra arguments.

Can you add lbann.CallbackDebug at

https://github.com/LLNL/lbann/blob/1b1e3198853566f7417a1dd2477d2e6c4217e6e7/applications/vision/lenet.py#L79

This will printf at the beginning and end of every layer. That can give us an idea of what's hanging.

timmoon10 avatar Oct 22 '20 04:10 timmoon10

@timmoon10 @benson31 I noticed that line was already added. I have attached my log which contains whatever gets printed in the console. I terminated the process because it gets stuck after starting the epoch and takes a long time to process. log.txt

Another question here is, how to use GPU for running lbann framework? I installed lbann+cuda now using: spack install lbann+cuda+nccl~al and I loaded the cuda modules, but when I run the prototext file it does not detect cuda since the epochs take a long time in general. Is there anything else that needs to be done for the process to run on a gpu?

gitosu67 avatar Oct 22 '20 13:10 gitosu67

I don't see the debug callback in the log. At the line I gave you, we configure the model with three callbacks to print the model description, metrics, and times. Can you add a fourth callback (lbann.CallbackDebug) to the list? Also, it looks like you're running three instances of LBANN at the same time, each one running with 1 MPI rank? I don't think it should cause problems (other than mangling the log file), but I'm wondering if something is misconfigured.

When we move on to running with GPUs, can you try building with cuDNN and Aluminum enabled? cuDNN is required for GPU support and Aluminum is highly recommended for GPU communication.

timmoon10 avatar Oct 22 '20 18:10 timmoon10

@timmoon10 to run on GPU I am building: spack install [email protected]+cuda+nccl ^[email protected] ^[email protected] ^conduit~fortran ^[email protected] But after installing and loading the modules using: spack load lbann-(packagename) (also loading aluminum, hydrogen,cuda and cudnn using spack), I get a different error as follows:


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'hydrogen::CUDAError'
  what():  Assertion
    h_check_cuda_error_code__ == cudaSuccess
in function
    void hydrogen::gpu::SetDevice(int)
failed!
{
    File: /tmp/pbstmp.11449746/jrodmanu/spack-stage/spack-stage-hydrogen-1.4.0-csf4wubjm674hlj6emasox7emmnwbqfl/spack-src/src/hydrogen/device/CUDA.cpp
    Line: 91
    Mesg: CUDA error detected in command: "cudaSetDevice(device_id)"
    Error Code: 3
    Error Name: cudaErrorInitializationError
    Error Mesg: initialization error
}

gitosu67 avatar Oct 22 '20 18:10 gitosu67

I'm not too familiar with the build system and the main developer is on vacation for the rest of the week, but I'll give it a shot. Is your Spack environment or system environment configured correctly? It looks like CUDA is picking up the wrong Nvidia driver, so maybe you need to add /usr/lib or /usr/lib64 to your LD_LIBRARY_PATH before running LBANN. If that doesn't fix it, I would try getting a simple "hello world" CUDA program to work with the CUDA installation in your Spack environment.

Pinging @benson31 and @bvanessen.

timmoon10 avatar Oct 22 '20 19:10 timmoon10

I did but still getting the same error. It might be a version problem but the following shows when I do module list:

1) xalt/latest                                       9) lbann-0.101-gcc-8.4.0-3uaigsn
  2) gcc-compatibility/8.4.0                          10) hydrogen-1.4.0-gcc-8.4.0-smil2g7
  3) intel/19.0.5                                     11) hydrogen-1.4.0-gcc-8.4.0-csf4wub
  4) modules/sp2020                                   12) hydrogen-1.4.0-gcc-8.4.0-cssa7de
  5) lbann-0.101-gcc-8.4.0-pdo7mw4                    13) openmpi/4.0.3
  6) hydrogen-1.4.0-gcc-8.4.0-vhazpqq                 14) aluminum-0.4.0-gcc-8.4.0-rrqoi7d
  7) aluminum-0.4.0-gcc-8.4.0-vpq3wyz                 15) nccl-2.7.8-1-gcc-8.4.0-47lyinw
  8) cudnn-8.0.4.30-11.0-linux-x64-gcc-8.4.0-n2fy4nf  16) cuda/10.2.89

I have tried experimenting with different versions of lbann so there are quite a few modules for that. I have loaded all of them using modules. Is that causing a problem?

I tested a sample 'hello world' in cuda and that works!

__global__ void cuda_hello(){
    printf("Hello World from GPU!\n");
}

int main() {
    cuda_hello<<<1,1>>>(); 
    return 0;
}

gitosu67 avatar Oct 22 '20 19:10 gitosu67

Your setup looks sensible to me. In my workflow I build the dependencies in a Spack environment and build LBANN with CMake, and I just need to load one modulefile before running LBANN:

. ${spack_root}/share/spack/setup-env.sh
spack env activate -p lbann-dev-power9le
module use ${module_dir}
module load lbann-0.102.0

It's different than your setup since I'm using a modulefile produced by LBANN rather than Spack, so I'm unsure how applicable this is.

If we want to wait on debugging the GPU build issues until @benson31 gets back, we can try working out the hang in the non-GPU version instead.

timmoon10 avatar Oct 22 '20 21:10 timmoon10