action-recognition-visual-attention Initialization of LSTM layers

How did you initialize the cell state and the hidden state of the LSTM layers? You gave an equation but didn't explain much. I wonder what the f_init function is. I read the code and guess it is a tanh function. How did you do that separately for the 3 layers? And I don't know what the X meant. Is it the feature of a single sample or a batch?

Mar 11 '16 05:03 xlliu7

Hi @limingqishi Can you replicate the results in the paper?

Mar 14 '16 13:03 frajem

Hi @frajem I have finished the experiment. I got an accuracy of 94.24% on the UCF11 test set and 39.5% on the HMDB51 split1. How are your results?

Apr 12 '16 01:04 xlliu7

Hi @limingqishi

Have you replicated the experiments of the paper ? I'm wondering, if you improved the acc. % of the paper. The reported UCF-11 accu % on paper are:

Softmax Regression (full CNN feature cube) 82.37 Avg pooled LSTM (@ 30 fps) 82.56 Max pooled LSTM (@ 30 fps) 81.60 Soft attention model (@ 30 fps, λ = 0) 84.96 Soft attention model (@ 30 fps, λ = 1) 83.52 Soft attention model (@ 30 fps, λ = 10) 81.44

I'm working on replicate the results, but I'm having lots of troubles.

Apr 12 '16 22:04 GerardoHH

Hi @GerardoHH @kracwarlock I randomly selected samples for training and testing on the UCF11 dataset. I noticed that many samples in the UCF11 set actually came from one long video. In that case the training set and test set can be similar. So overfitting can lead to high accuracy on the testset. I think it would be better to do experiment on the HMDB51 set, where the train-test split files are included. I have only tested soft attention model(@30fps, lambda=0) so far. I can't replicate the reported HMDB51 acc. %. The best performance currently is 39.5%. I wonder if I preprocessed data improperly.

Apr 13 '16 01:04 xlliu7

@limingqishi @GerardoHH I had a query how can we test this code on UCF-11 dataset , when I downloaded the dataset from http://crcv.ucf.edu/data/UCF_YouTube_Action.php there are no .h5 files , so can you please help me in running this code , moreover what kind of computer specs are required to run this thing ( ram , os) ? Please do respond .

Apr 13 '16 05:04 rishabh135

Hi @rishabh135 h5 is the filename extension for HDF5 database, which is designed to store and organize large amounts of data. You need to create .h5 files yourself. This link may help you. Good luck!

Apr 13 '16 10:04 xlliu7

@limingqishi can you please also help with what kind of computer will be sufficient to run this code , I have 4GB RAM with 2GB Nvidia Geforce 820M graphic card in windows 7 , will it suffice ? As I was reading other answers , I saw that it has been run with 48GB ram previously , so will I not be able to run this on my pc ?

Apr 13 '16 11:04 rishabh135

@rishabh135 if you don't have enough RAM, you should use h5create and h5write to create h5 files. In that case, I guess 4GB RAM is enough. I ran the code on a server with 3GB GPU memory. I think it's okay with 2GB memory, cuz the memory usage is less than 2GB most of the time. You can try to reduce batchsize if it doesn't work. But the 820M card might be slow. If possible, you can use a server.

Apr 13 '16 11:04 xlliu7

@rishabh135 @limingqishi I have a laptop with 16GB in RAM, and GPU 980M with 4GB in memory, and I have to reduce the test batch size from 256 to 128 because the out of memory error. The script scripts.evaluate_ucf11 takes like 2-3 hours to complete. And I'm sure that my preprocessing is wrong (h5 file).

Apr 13 '16 16:04 GerardoHH

@GerardoHH @kracwarlock @limingqishi I am facing issue while running the script , I get "No GPU board available" error , any idea what is causing this , and also I am not entirely clear what features I have to extract (SIFT , HOG , SURF) from videos and then stack them to get the .h5 file , any help in this will be tremendously useful .

Apr 13 '16 17:04 rishabh135

Hi @limingqishi I am training and testing on the UCF11 dataset. It takes over 1 day to run one training epoch. I do not know why it is so slow. Can you share your running time and setting information with me?

My running environment is, System: ubuntu 12.04.5 LTS GPU: Tesla K10.G2.8GB

My config file of theano .theanorc is,

[global]
floatX = float32
device = gpu
optimizer = fast_run

[lib]
cnmem = 0

[dnn]
enabled = True

[nvcc]
fastmath = True

[blas]
ldflags = -L/home/anaconda/lib -L/usr/lib -lf77blas -lcblas -latlas

Apr 14 '16 13:04 kyuusaku

Hi everyone. I am sorry for all the delay. I was very busy with my thesis and graduation. I am no longer at the University of Toronto but will try to reply regularly here.

Apr 17 '16 21:04 kracwarlock

@limingqishi The cell state and hidden state initialization happens in these lines: https://github.com/kracwarlock/action-recognition-visual-attention/blob/master/src/actrec.py#L368-L369 It is basically done by:

your features for a batch are of shape (no of timesteps, batch size, 7x7, 1024)
you take the mean along dimension 0 and 2 and get a mean context (batch size, 1024)
this connects to a dense layer with the same size as the states and with tanh activation

I see that I did not release the multi-layer LSTM code. I will try to do that as soon as I have time. Till then this is how it is done https://github.com/kelvinxu/arctic-captions/blob/master/capgen.py#L542-L548. In the paper the X means the feature of a single sample. In the code everything is done on a batch.

Apr 17 '16 21:04 kracwarlock

@frajem @limingqishi @GerardoHH Yes UCF-11 has no standard train test split and the accuracy will depend on the split. That's why we didn't report any further results on it. You can overfit and perform very well.

Apr 17 '16 21:04 kracwarlock

@rishabh135 Also take a look at my comments two posts above this one. I will try to make this easier as soon as possible.

Apr 17 '16 21:04 kracwarlock

@rishabh135 https://github.com/kracwarlock/action-recognition-visual-attention/blob/5e3d0ab792195594cd422252cbac3f01333eb7ee/util/README.md#gpu-locking You should remove these lines. The GPU locking code was intended only for University of Toronto ML server users.

Apr 17 '16 21:04 kracwarlock

@limingqishi Did you use a 3-layer LSTM for the experiments on HMDB-51? If not, that would do the trick. If yes, let me know all your hyperparams.

Apr 17 '16 21:04 kracwarlock

@kracwarlock how do we get the .h5 file from the youtube action dataset videos , do we need to first extract its features ("hog") and then stack them in a matrix , can anyone please mention a simple program to reduce the dataset to .h5 file and also what does train_labels.txt contain ?

Apr 18 '16 12:04 rishabh135

@rishabh135 If you can ask this on the relevant issue (https://github.com/kracwarlock/action-recognition-visual-attention/issues/6) that would be great. If that issue does not cover your questions please open a separate issue.

Apr 24 '16 03:04 kracwarlock

Hi @kracwarlock! I am also trying to reproduce your results on HMDB-51 and Hollywood2 (after reading this post I think I will skip the UCF-11). Can you please share the files valid_labels.txt, train_labels.txt, test_labels.txt, train_filenames.txt, test_filenames.txt and valid_filenames.txt for that two datasets? I will appreciate it a lot :) :) :)

Oct 03 '16 14:10 jacopocavazza

@jacopocavazza hey can you open a new issue for that since this is not related

Oct 04 '16 05:10 kracwarlock