VideoGPT icon indicating copy to clipboard operation
VideoGPT copied to clipboard

Compute compatibility?

Open slerman12 opened this issue 3 years ago • 6 comments

Would you happen to have a rough estimate of the kind of compute needed to run this model? Unfortunately, we are subject to a very limited compute scenario and I am getting memory allocation errors when trying to run under the default settings.

Thank you for any support.

slerman12 avatar Jun 10 '21 15:06 slerman12

The model should run on 4 GPUs with ~24GB of memory each. I will change the default batch size in scripts/train_videogpt.py, as it should be something like 4 or 8 (batch size per GPU) to get a total batch size across all GPUs of around 32.

If you haven't tried it yet, I also suggest using sparse attention, as you get some memory usage reduction and speed-up when training the model.

wilson1yan avatar Jun 10 '21 19:06 wilson1yan

Thank you so much! I'll give that a try.

slerman12 avatar Jun 11 '21 14:06 slerman12

Don't want to keep prodding you, but I ran the provided Sparse Attention installation script:

sudo apt-get install llvm-9-dev

And received this trace:

Reading package lists... Done
Building dependency tree    
Reading state information... Done
E: Unable to locate package llvm-9-dev

I tried installing llvm another way:

bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

This worked, but the subsequent install deepseed command did not:

Command errored out with exit status 1

slerman12 avatar Jun 11 '21 17:06 slerman12

Hmm not too sure what the issue is. Have you tried running sudo apt update or sudo apt-get install before installing llvm-9-dev? This page might also have some useful information.

For the deepspeed install, do you know what the exact error was?

wilson1yan avatar Jun 11 '21 20:06 wilson1yan

The trace is pretty long, but I think it was this:

csrc/sparse_attention/utils.cpp:110:90: warning: narrowing conversion of ‘H’
 from ‘size_t {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
    error: command '/usr/bin/gcc' failed with exit code 1

Maybe our system has some issue with gcc? I'm not too familiar with this system-level stuff.

slerman12 avatar Jun 12 '21 22:06 slerman12

I believe that is essentially the same error that you mentioned above failed with exit code 1, and right above that is just a warning, and not the error. The error should be somewhere else up in the logs.

Have you tried looking at some of the github issues on the Deepspeed repo that might be relevant? Such as this one

One other option is to try out the Dockerfile in the other VideoGPT related repo

wilson1yan avatar Jun 15 '21 04:06 wilson1yan