AVT icon indicating copy to clipboard operation
AVT copied to clipboard

Can't produce result for pre-trained TSN. Takes long time

Open ttran1904 opened this issue 3 years ago • 13 comments
trafficstars

Hi,

I was wondering how long would it take to evaluate a pre-trained TSN model on EK100 data? I tried to run it for a long time, and it won't stop or produce a result. I'm using 4 12GB-GPUs and was using:

  • batch_size=1
  • train.num_epoch=0 (no training, just eval)
  • +model.future_predictor.n_head=4
  • +model.future_predictor.n_layer=4
  • +model.future_predictor.inter_dim=64

However, my run never output anything new after this log: Screen Shot 2022-01-12 at 3 06 16 PM

How long does a pre-trained model would take to just evaluate?

I just want to see how well the pre-trained model works to reproduce the same numbers. But the program just keep running, no more logs are produced after this screen shot. I even reduced the batch size, heads, layers, etc. so that it should run faster.

ttran1904 avatar Jan 12 '22 23:01 ttran1904

@ttran1904 I think that you need to use more GPUs and/or increase RAM. Your RAM seems to be getting full.

Anirudh257 avatar Jan 13 '22 12:01 Anirudh257

Do you know what is the RAM full from? I'm not using the backbone and loaded features because I'm just running a pre-trained model on validation data.

ttran1904 avatar Jan 13 '22 19:01 ttran1904

@ttran1904 even when you are just evaluating, the model will sample frames from the video that will require the entire video to be loaded into the RAM. See here; https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/datasets/base_video_dataset.py#L639

Anirudh257 avatar Jan 13 '22 19:01 Anirudh257

@Anirudh257 I'm confused why are videos used? Since I downloaded a pre-trained TSN model, MP4 aren't used at all, right?

ttran1904 avatar Jan 13 '22 21:01 ttran1904

Hi @ttran1904 Something seems wrong with the data setup -- the code doesn't find action labels for any of the samples. Was the data setup correctly as described in the README? You are running this on the val set, right?

Also, if you change the heads/layers/intermediate dim size, the pretrained models can't be loaded in, so I'd recommend keeping them fixed if you want to repro the results.

Can you try to run expts/02_ek100_avt_tsn.txt config after adding top-level config options:

test_only=True
train.init_from_model=[[${cwd}/OUTPUTS/expts/02_ek100_avt_tsn.txt/0/checkpoint.pth]]

(similar to how I test that model on the test set of EK100 in this) This would run testing only using the pretrained model

rohitgirdhar avatar Jan 13 '22 21:01 rohitgirdhar

Hi @rohitgirdhar I have all the data set up just like in the README file, except for the DATA/video/. I didn't download the EK100 MP4 videos because I thought I am only testing, so just the extracted features are enough. Is it?

Other wise, I have all the other components correctly as in the data structure that you mentioned in the README file. I also tried to add the 2 lines you said in expts/02_ek100_avt_tsn.txt and don't change the heads/layers/intermediate dim size, but those don't work. Maybe it's in the data structure problem.

ttran1904 avatar Jan 15 '22 06:01 ttran1904

Yes you shouldn't need videos for experiments on pre-extracted features. Can you share the full output log? It is strange that it is not able to read any labels -- that should only happen for test datasets (not val). Just to confirm, you have files inside external/epic-kitchens-100-annotations?

rohitgirdhar avatar Jan 17 '22 15:01 rohitgirdhar

@rohitgirdhar Yes, I have external/epic-kitchens-100-annotations git-cloned already. I hope I am not missing anything... This is my full output file: AVT.log

ttran1904 avatar Jan 17 '22 21:01 ttran1904

I don't see anything strange in the logs (it no longer complains about missing labels either, unlike the screenshot you shared earlier). I see you are running it with submitit, is there a stdout/stderr log? Can you try running it locally (using -l) to see if it prints more information that way?

rohitgirdhar avatar Jan 21 '22 19:01 rohitgirdhar

Yes! I've been running with -l for all runs up till now. So, I recently try to recompile everything again: environment yaml, make sure all the folders/download are installed like last time, and refresh lab gpu config. I wanted to make sure that it's not any gpu config problem or any installation issues. Then, I ran 02_ek100_avt_tsn.txt with this command (added with the 2 lines test_only=true ... you said):

python launch.py -l -c expts/02_ek100_avt_tsn.txt

And I finally got an out of memory error: 🥲

RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_ LAUNCH_ BLOCKING=1.

Here is the Log file for this run (I clear up the log file for this run; sorry, I just realized it's an aggregated of all logs):

new-avt.log

I looks like it loaded the file TSN (RBG) model correctly, but it stop at the last line and output CUDA error: out of memory on terminal. Is this a gpu problem now?

I tried for 02_ek100_avt_tsn_test_testonly.txt, and it's now basically the same out of memory error.

ttran1904 avatar Jan 22 '22 05:01 ttran1904

@ttran1904 As I mentioned above, it is a RAM issue. You don't need to load the videos as I incorrectly said earlier. But you may need to clear the RAM before the next epoch or use 8 GPUs atleast.

Anirudh257 avatar Jan 22 '22 07:01 Anirudh257

Thank you, @Anirudh257 ! Hmm... @rohitgirdhar, I was curious why there is a CUDA error: out of memory for a pre-trained TSN? The frames don't need to be loaded, right? Since it's a pre-trained model (and it's uses features only). What's using up so many memory?

ttran1904 avatar Jan 22 '22 19:01 ttran1904

Hi @ttran1904 apologies for the delay in responding. Yes that should not happen -- the TSN model is fairly lightweight. Can you check what your GPU memory usage is before and after you start running this job? Or maybe try running on a different GPU if it's a multi-GPU machine using CUDA_VISIBLE_DEVICES="1" python launch -c ....

rohitgirdhar avatar Jan 31 '22 14:01 rohitgirdhar