pytorch-vsumm-reinforce icon indicating copy to clipboard operation
pytorch-vsumm-reinforce copied to clipboard

Find features, change points, num_frames and positions for custom test video

Open mayank26saxena opened this issue 5 years ago • 38 comments

Hi @KaiyangZhou,

I wanted to know how I can find the following features to generate a summary for a custom video:

  • Features (for finding seq and probs)
  • Change points (cps)
  • Number of frames (num_frames)
  • Number of frames per seg (nfps)
  • Positions

Please let me know!

mayank26saxena avatar Apr 09 '19 23:04 mayank26saxena

  • Features : you have to extract feature.

    • length : n_frames/15 ( picks value is [0, 15, 30, 45, ... , len(frames)] in SumMe, TVSum dataset)
    • convert video to frames. every 15th frame extraction.
  • Change points : you have to use KTS. here

    • i try it...
  • Number of frames : number of video frames.

  • Number of frames per seg : fisrt, you have to get change points. and then you can get nfps.

  • Positions : ( 15 in SumMe, TVSum dataset)

if you've solved, i want to know how to use KTS.

SinDongHwan avatar May 31 '19 05:05 SinDongHwan

Hi,

For change point detection. What should i input to KTS? Flatten image as HxW dimension input or using some feature extraction methods so that the image become some N dimension input? What is being used in this paper to preprocess the image/frame?

hungbie avatar Jul 30 '19 03:07 hungbie

@hungbie Hi, you should input features of each frame.
You can use KTS in "utils/generate_dataset.py" at this

SinDongHwan avatar Jul 30 '19 06:07 SinDongHwan

@SinDongHwan Thank you! I will take a look!

hungbie avatar Jul 30 '19 06:07 hungbie

@hungbie Hi, you should input features of each frame. You can use KTS in "utils/generate_dataset.py" at this

I understand your approach. I have tried and come to the same thing using features from Googlenet or Resnet. However, I think in original KTS paper, the author used SIFT + Fisher vector to generate the descriptors. Have you tried this method?

hungbie avatar Jul 30 '19 06:07 hungbie

I understand your approach. I have tried and come to the same thing using features from Googlenet or Resnet. However, I think in original KTS paper, the author used SIFT + Fisher vector to generate the descriptors. Have you tried this method?

yes, i tried to use SIFT+Fisher vector. but i gave up. I know that SIFT+Fisher vector method is to generate features. and i thought that author used SIFT+Fisher vector, because cnn had not existed.

SinDongHwan avatar Jul 30 '19 07:07 SinDongHwan

ok. I will try but since SIFT is patented anyway it's good to look into other methods. Thank you!

hungbie avatar Jul 30 '19 08:07 hungbie

@hungbie Okay!! Good Luck^^

SinDongHwan avatar Jul 30 '19 08:07 SinDongHwan

@hungbie Hi, you should input features of each frame. You can use KTS in "utils/generate_dataset.py" at this

Hi @SinDongHwan I use your code "generate_dataset.py" then,I fond the size of feature is 2048? what can i do

Harryjun avatar Oct 21 '19 09:10 Harryjun

@Harryjun

I sent email, now!! but i will write here, too. for people having same question .

Hi, Harryjun~!!train dataset is generated using GoogleNet.but, my codes extract features using ResNet. so size of feature is 2048. i had tested and then i've gotten following result. 1) I had extracted features using GoogleNet, and i've gotten change points. 2) GoogleNet(generated by me) was worse  then open dataset(TVSum, SumMe, etc).so i tried using ResNet. and i got better results than GoogleNet(generated by me).but ResNet is more deep than GoogleNet. so it is a little bit slow to generate feature and get change points.but if you using batches, you can get it faster. if you want to 1024 features using ResNet.I think if input size of ResNet is (112,112), will extract features 1024. just my guess.^^

Good Luck!!

SinDongHwan avatar Oct 22 '19 01:10 SinDongHwan

@SinDongHwan Hi,I find we make the h5 file so slowly,like this image it process one frame / second could you have this problem,and give me some suggestion?

Harryjun avatar Oct 28 '19 06:10 Harryjun

@Harryjun Hi, i can tell you exactly, if i have to see your situation. i guess main memory is lack. if out of memory, it is slow to swap in/out data. so, it will be slow for you to make h5 file, too.

when you execute code, Can you check your memory? If it's right, you can try two methods. (just my ideas ^^)

1st method,

"split codes extracting all features for getting change points, 15th features for training dataset."

first, you have to extract all features, and then get change points.
second, you extracts 15th features for training dataset.

2nd method

"extract all features, and then select 15th frames."

first, you have to extract all features, and then get change points. second. you select 15th features from all features for training dataset.

Do you use hangout? i want to see your situation on teamviewer.

SinDongHwan avatar Oct 28 '19 09:10 SinDongHwan

@SinDongHwan I make some change in this resp,we could not save so many frames,then,we only save feature.(because the train only use feature.),at last,we make summary by the opencv frame. the reason is the work that saving frames cost experiense. https://github.com/Harryjun/pytorch-vsumm-reinforce

First make datastes python video_forward2.py --makedatasets --dataset data_our/data_h5/data1.h5 --video-dir data_video/data1/ --frm-dir data_our/frames Second make score and generate summary python3 video_forward2.py --makescore --model log/summe-split0/model_epoch1000.pth.tar --gpu 0 --dataset data_our/data_h5/data2.h5 --save-dir logs/videolog/ \ --summary --frm-dir data_our/frames

Harryjun avatar Oct 31 '19 07:10 Harryjun

@Harryjun you're right. can't save so many frames. when i make dataset, i had tried to save all frames. i've faced to be slow. so, i removed a code line (save frame).

SinDongHwan avatar Oct 31 '19 08:10 SinDongHwan

@SinDongHwan Hi,I find the checkpoint got by KTS is not same with the datasets author give.what the reason

Harryjun avatar Nov 04 '19 04:11 Harryjun

@Harryjun Yes, not same. I've not solved about that. TT But result was not bad, when use Resnet.

SinDongHwan avatar Nov 05 '19 02:11 SinDongHwan

@SinDongHwan Hi,I want to ask you some question,Recently, we are using a video keyframe extraction to do a work.so I make a test about DSN ,the I fond it can reglect some frames ,not very good. then ,I would like to ask you how to extract key frames for long videos.can you give me some suggestion. Thank you very much.

Harryjun avatar Nov 05 '19 02:11 Harryjun

@Harryjun How long videos are? i think, you can get good results if you have proper change points. I've read many papers about video summarization. but I'm not video summary researcher. I'm just computer engineer. so i can't suggest nice idea.

I think you can have good results if you read many papers and think many about how to improve. Good Luck~!! You can do it!

SinDongHwan avatar Nov 05 '19 10:11 SinDongHwan

@Harryjun @SinDongHwan my changepoint differs alot, from the actual H5 file. for example, in the actual H5 file for Video 1 change points have a difference of 100 frames. By KTS in "utils/generate_dataset.py" , gives me different results,so my network just select the starting frames for generating the video summary. Can you please help, how could I make changes in the change point

Swati640 avatar Nov 05 '19 12:11 Swati640

@Swati640 , @Harryjun I think you should ask the author of the paper or the creator of the dataset for help to get similar change points to the dataset.

SinDongHwan avatar Nov 06 '19 01:11 SinDongHwan

@Swati640 @SinDongHwan you can send the author a email to get some suggestion,if you have some solution,please tell me ,thanks! And I think First,different net or params will make different changepoint. Second,the author first average the socre in every shot(changepoint[x,y]),then get the higher,I think we can make summary by get the higher frames of each two changepoints,for example,[0,23],[23,50]we can get a key frame in [0,23],],and select 0.15*23 nums.it will make you consider all shot. I think it,youcan try it .

Harryjun avatar Nov 06 '19 02:11 Harryjun

@SinDongHwan @Harryjun if you have used GoogleNet for feature extraction. Please let me know , how did you guys do that for 1024 size feature extraction.

Swati640 avatar Nov 11 '19 19:11 Swati640

@Swati640 @Harryjun I've tried to extract features using GoogleNet. I just used codes from google search. (googlenet feature extract)

tell me your email. i will send you. but when i tried this, i got a bad results.

SinDongHwan avatar Nov 13 '19 02:11 SinDongHwan

I tried as well, just getting the dimension same, rest very bad results for change point. I would like to compare my code with yours so it would be really helpful. My email id is [email protected]. Thanks in advance :)

Swati640 avatar Nov 13 '19 02:11 Swati640

@Swati640 i sent you email. There is not GoogleNet in latest version of torchvision. So, you have to add and edit code while referring my email. Good Luck^^

SinDongHwan avatar Nov 13 '19 05:11 SinDongHwan

@SinDongHwan @Harryjun Hi~ Thanks for your codes,When i run your codes ,I encounter with some problems:

File " video_forward2.py", line 236, in module from utils.generate_dataset import Generate_Dataset ImportError: No module named generate_dataset

I don't know how to create a right dataset on myself video with your codes ,could you tell me some details? thank you again~

harvestlamb avatar Dec 20 '19 08:12 harvestlamb

@harvestlamb Hi~!!

To get a dataset on your video, you should make following data.

  1. 'features' : feature of each 15th frame
  2. 'picks' : 15
  3. 'n_frames' : number of frames of a video
  4. 'fps' : frame per second
  5. 'change_points' : shot or scene change points.
    • to get change_points, you should use KTS.
    • Depending on which CNN you use, you will get different change_points. generate_datast
  6. 'n_frame_per_seg' : number of frame in interval of each change points.

if you want train using supervised learning, i think you should gound-truth( '0/1' about each 15th frames) and you have to make policy. because there is no right ground-truth in the summary.

SinDongHwan avatar Dec 21 '19 07:12 SinDongHwan

@SinDongHwan Thank you very much I try your code,and i have made myself dataset, and try to train it.and it create result.h5(only has reward , not has f-score).I encounter with this problems:

===> Evaluation
Traceback (most recent call last):
  File "video_summarization.py", line 224, in <module>
    main()
  File "video_summarization.py", line 129, in main
    evaluate(model, dataset, test_keys, use_gpu)
  File "video_summarization.py", line 167, in evaluate
    user_summary = dataset[key]['user_summary'][...]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/oliver/anaconda3/envs/PY2/lib/python2.7/site-packages/h5py/_hl/group.py", line 177, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'user_summary' doesn't exist)"

the failure of Evaluation, I think that maybe i don't label Grouth truth :''user_summary,gts_score and gtsummary in my dataset'',Should i label '0/1' about these indexs on my videos each 15th frames?could your give me some guidance on labeling these indexs on my video? (because i never label like this dataset) thank you again! Best wish to you~

harvestlamb avatar Dec 21 '19 15:12 harvestlamb

@harvestlamb Hi. I missed 'user_summary' this need when evaluate. but don't need when test. 'user_summary' is data from n people.

i never label,too. you have to convert your video to frames. and then assign '0/1' all frames or 15th frames. you can refer summe dataset or tvsum dataset.

SinDongHwan avatar Dec 22 '19 04:12 SinDongHwan

@SinDongHwan thank you very much ,i analyse the summe data with your guidance video_1 : <HDF5 dataset "user_summary": shape (15, 4494), type "<f4"> video_10 : <HDF5 dataset "user_summary": shape (15, 9721), type "<f4"> video_11 : <HDF5 dataset "user_summary": shape (15, 1612), type "<f4"> video_12 : <HDF5 dataset "user_summary": shape (15, 950), type "<f4"> video_13 : <HDF5 dataset "user_summary": shape (15, 3187), type "<f4"> video_14 : <HDF5 dataset "user_summary": shape (15, 4608), type "<f4"> video_15 : <HDF5 dataset "user_summary": shape (17, 6096), type "<f4"> video_16 : <HDF5 dataset "user_summary": shape (15, 3065), type "<f4"> video_17 : <HDF5 dataset "user_summary": shape (15, 6683), type "<f4"> video_18 : <HDF5 dataset "user_summary": shape (17, 2221), type "<f4"> video_19 : <HDF5 dataset "user_summary": shape (17, 1751), type "<f4"> video_2 : <HDF5 dataset "user_summary": shape (18, 4729), type "<f4"> video_20 : <HDF5 dataset "user_summary": shape (17, 3863), type "<f4"> video_21 : <HDF5 dataset "user_summary": shape (15, 9672), type "<f4"> video_22 : <HDF5 dataset "user_summary": shape (15, 5178), type "<f4"> video_23 : <HDF5 dataset "user_summary": shape (15, 4382), type "<f4"> video_24 : <HDF5 dataset "user_summary": shape (15, 2574), type "<f4"> video_25 : <HDF5 dataset "user_summary": shape (16, 3120), type "<f4"> video_3 : <HDF5 dataset "user_summary": shape (15, 3341), type "<f4"> video_4 : <HDF5 dataset "user_summary": shape (15, 3064), type "<f4"> video_5 : <HDF5 dataset "user_summary": shape (15, 5131), type "<f4"> video_6 : <HDF5 dataset "user_summary": shape (16, 5075), type "<f4"> video_7 : <HDF5 dataset "user_summary": shape (15, 9046), type "<f4"> video_8 : <HDF5 dataset "user_summary": shape (17, 1286), type "<f4"> video_9 : <HDF5 dataset "user_summary": shape (15, 4971), type "<f4">

So user_summary's shape like (x,y), Obviosly,'y' represent n_frames ,Dose 'x' represent the number of people needed for labeling ? To facilitate labeling ,could i reduce some dimensionality or use similar labels in 15 dim?

harvestlamb avatar Dec 22 '19 14:12 harvestlamb