Ask-Anything icon indicating copy to clipboard operation
Ask-Anything copied to clipboard

[PERFORMANCE_REPORT]+[OPTIMIZATION]/[SUGGESTION]

Open spacewalkingninja opened this issue 1 year ago • 2 comments

Sadly can not get stablelm to work on 1070 w 8G vram and 36 gb vram. Sad to compile all on win to see it crash but hey. Here's a little treat for the authors, since you are the good kind that provides all and models and shit so we can start doing things right away and u are not like tyhe bad people who dont provide models so we can startdoing things:

    print("Watching video...")
    data = loadvideo_decord_origin(video_path)
    progress(0.2, desc="Loading Videos")
    print("Step 1/4")
    # InternVideo
    action_index = np.linspace(0, len(data)-1, 8).astype(int)
    tmp,tmpa = [],[]
    for i,img in enumerate(data):
        tmp.append(transform(img).to(device).unsqueeze(0))
        if i in action_index:
            tmpa.append(topil(img))
    action_tensor = trans_action(tmpa)
    TC, H, W = action_tensor.shape
    action_tensor = action_tensor.reshape(1, TC//3, 3, H, W).permute(0, 2, 1, 3, 4).to(device)
    with torch.no_grad():
        prediction = intern_action(action_tensor)
        prediction = F.softmax(prediction, dim=1).flatten()
        prediction = kinetics_classnames[str(int(prediction.argmax()))]
    print("Step 2/4")
    # dense caption
    dense_caption = []
    dense_index = np.arange(0, len(data)-1, 5)
    original_images = data[dense_index,:,:,::-1]
    dcs = {}
    with torch.no_grad():
        for original_image in original_images:
            dense_caption.append(dense_caption_model.run_caption_tensor(original_image))
        #dense_caption = ' '.join([f"Second {i+1} : {j}.\n" for i,j in zip(dense_index,dense_caption)])
        for i,j in zip(dense_index,dense_caption):
            key = f"{i+1}"
            value = f"\n View at {i+1} seconds: {j}.\n"
            dcs[key] = value
    print("Step 3/4")  
    # Video Caption
    image = torch.cat(tmp).to(device)   
    model.threshold = 0.68
    if input_tag == '' or input_tag == 'none' or input_tag == 'None':
        input_tag_list = None
    else:
        input_tag_list = []
        input_tag_list.append(input_tag.replace(',',' | '))
    with torch.no_grad():
        caption, tag_predict = model.generate(image,tag_input = input_tag_list,max_length = 50, return_tag_predict = True)
        print("Step 4/4")
        progress(0.6, desc="Watching Videos")
        #frame_caption = ' '.join([f"Second {i+1}:{j}."+str(dcs.get(str(i+1), ""))+"\n" for i,j in enumerate(caption)])
        frame_caption = ""
        prev_caption = ""
        counter = 1
        for i, j in enumerate(caption):
            current_caption = f"{j}."
            current_dcs = dcs.get(f"{i+1}", "")
            if current_caption == prev_caption:
                frame_caption += f" {current_dcs}"
                counter += 1
            else:
                frame_caption += f"Second {i+1} - "
                frame_caption += f"{i+1+counter}:{current_caption}{current_dcs}"
                prev_caption = current_caption
        if input_tag_list == None:
            tag_1 = set(tag_predict)
            tag_2 = ['none']
        else:
            _, tag_1 = model.generate(image,tag_input = None, max_length = 50, return_tag_predict = True)
            tag_2 = set(tag_predict)
        progress(0.8, desc="Understanding Videos")
        
    print("[INFO]" + video_path + " Analyzed")
    print("[TAGS] "+ str( ' | '.join(tag_1) + ' | '.join(tag_2)))
    print(frame_caption)
    #print(frame_caption, dense_caption)

    del data, action_tensor, original_image, image,tmp,tmpa
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
    return ' | '.join(tag_1),' | '.join(tag_2), frame_caption, dense_caption, gr.update(interactive = True), prediction

with this the output that goes to the llm is better compressed like:

Mine:

Second 1 - 2:people walking up a hill towards a small plane being carried by a man.
 View at 1 seconds: man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
  Second 4 - 7:a man flying a blue and white kite over a hill.
 View at 6 seconds: blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.
Second 7 - 12:a bald head looking out over a hill at a man flying a kite with wings.

vs Original

 Second 1:people walking up a hill towards a small plane being carried by a man.
 Second 2:people walking up a hill towards a small plane being carried by a man.
 Second 3:people walking up a hill towards a small plane being carried by a man.
 Second 4:a man flying a blue and white kite over a hill.
 Second 5:a man flying a blue and white kite over a hill.
 Second 6:a man flying a blue and white kite over a hill.
 Second 7:a bald head looking out over a hill at a man flying a kite with wings.
 Second 8:a bald head looking out over a hill at a man flying a kite with wings.
 Second 9:a bald head looking out over a hill at a man flying a kite with wings.
Dense output:
 Second 1 : man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
 Second 6 : blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.

Unnecessary to have all the repeats go to the llm, just spending tokens.

Seems to me you aren't using whisper or anything to check for audio? I've been working on something similar, I use a local small WHISPER and it works fine to get transcripts.

spacewalkingninja avatar Apr 29 '23 04:04 spacewalkingninja

        frame_caption = ""
        prev_caption = ""
        start_time = 0
        end_time = 0
        last_valid_dcs = ''
        for i, j in enumerate(caption):
            current_caption = f"{j}."
            current_dcs = dcs.get(f"{i+1}", "")
            if len(current_dcs) > 0:
                last_valid_dcs = current_dcs
            if current_caption == prev_caption:
                end_time = i+1
            else:
                if prev_caption:
                    frame_caption += f"Second {start_time} - {end_time}: {prev_caption}{last_valid_dcs}\n"
                start_time = i+1
                end_time = i+1
                prev_caption = current_caption
        if prev_caption:
            frame_caption += f"Second {start_time} - {end_time}: {prev_caption}{current_dcs}\n"
        total_dur = end_time
        frame_caption += f"| Total Duration: {total_dur} seconds.\n"

Better now. kiss goodbye, gotta go sleep, follow @aifredreacts on yt and tiktok for more

spacewalkingninja avatar Apr 29 '23 05:04 spacewalkingninja

Thank you for your careful analysis and excellent suggestions! We added support for whisper in long_video_support, we will update our code with your suggestions!

yinanhe avatar Apr 29 '23 06:04 yinanhe