Ask-Anything
Ask-Anything copied to clipboard
[PERFORMANCE_REPORT]+[OPTIMIZATION]/[SUGGESTION]
Sadly can not get stablelm to work on 1070 w 8G vram and 36 gb vram. Sad to compile all on win to see it crash but hey. Here's a little treat for the authors, since you are the good kind that provides all and models and shit so we can start doing things right away and u are not like tyhe bad people who dont provide models so we can startdoing things:
print("Watching video...")
data = loadvideo_decord_origin(video_path)
progress(0.2, desc="Loading Videos")
print("Step 1/4")
# InternVideo
action_index = np.linspace(0, len(data)-1, 8).astype(int)
tmp,tmpa = [],[]
for i,img in enumerate(data):
tmp.append(transform(img).to(device).unsqueeze(0))
if i in action_index:
tmpa.append(topil(img))
action_tensor = trans_action(tmpa)
TC, H, W = action_tensor.shape
action_tensor = action_tensor.reshape(1, TC//3, 3, H, W).permute(0, 2, 1, 3, 4).to(device)
with torch.no_grad():
prediction = intern_action(action_tensor)
prediction = F.softmax(prediction, dim=1).flatten()
prediction = kinetics_classnames[str(int(prediction.argmax()))]
print("Step 2/4")
# dense caption
dense_caption = []
dense_index = np.arange(0, len(data)-1, 5)
original_images = data[dense_index,:,:,::-1]
dcs = {}
with torch.no_grad():
for original_image in original_images:
dense_caption.append(dense_caption_model.run_caption_tensor(original_image))
#dense_caption = ' '.join([f"Second {i+1} : {j}.\n" for i,j in zip(dense_index,dense_caption)])
for i,j in zip(dense_index,dense_caption):
key = f"{i+1}"
value = f"\n View at {i+1} seconds: {j}.\n"
dcs[key] = value
print("Step 3/4")
# Video Caption
image = torch.cat(tmp).to(device)
model.threshold = 0.68
if input_tag == '' or input_tag == 'none' or input_tag == 'None':
input_tag_list = None
else:
input_tag_list = []
input_tag_list.append(input_tag.replace(',',' | '))
with torch.no_grad():
caption, tag_predict = model.generate(image,tag_input = input_tag_list,max_length = 50, return_tag_predict = True)
print("Step 4/4")
progress(0.6, desc="Watching Videos")
#frame_caption = ' '.join([f"Second {i+1}:{j}."+str(dcs.get(str(i+1), ""))+"\n" for i,j in enumerate(caption)])
frame_caption = ""
prev_caption = ""
counter = 1
for i, j in enumerate(caption):
current_caption = f"{j}."
current_dcs = dcs.get(f"{i+1}", "")
if current_caption == prev_caption:
frame_caption += f" {current_dcs}"
counter += 1
else:
frame_caption += f"Second {i+1} - "
frame_caption += f"{i+1+counter}:{current_caption}{current_dcs}"
prev_caption = current_caption
if input_tag_list == None:
tag_1 = set(tag_predict)
tag_2 = ['none']
else:
_, tag_1 = model.generate(image,tag_input = None, max_length = 50, return_tag_predict = True)
tag_2 = set(tag_predict)
progress(0.8, desc="Understanding Videos")
print("[INFO]" + video_path + " Analyzed")
print("[TAGS] "+ str( ' | '.join(tag_1) + ' | '.join(tag_2)))
print(frame_caption)
#print(frame_caption, dense_caption)
del data, action_tensor, original_image, image,tmp,tmpa
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
return ' | '.join(tag_1),' | '.join(tag_2), frame_caption, dense_caption, gr.update(interactive = True), prediction
with this the output that goes to the llm is better compressed like:
Mine:
Second 1 - 2:people walking up a hill towards a small plane being carried by a man.
View at 1 seconds: man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
Second 4 - 7:a man flying a blue and white kite over a hill.
View at 6 seconds: blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.
Second 7 - 12:a bald head looking out over a hill at a man flying a kite with wings.
vs Original
Second 1:people walking up a hill towards a small plane being carried by a man.
Second 2:people walking up a hill towards a small plane being carried by a man.
Second 3:people walking up a hill towards a small plane being carried by a man.
Second 4:a man flying a blue and white kite over a hill.
Second 5:a man flying a blue and white kite over a hill.
Second 6:a man flying a blue and white kite over a hill.
Second 7:a bald head looking out over a hill at a man flying a kite with wings.
Second 8:a bald head looking out over a hill at a man flying a kite with wings.
Second 9:a bald head looking out over a hill at a man flying a kite with wings.
Dense output:
Second 1 : man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
Second 6 : blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.
Unnecessary to have all the repeats go to the llm, just spending tokens.
Seems to me you aren't using whisper or anything to check for audio? I've been working on something similar, I use a local small WHISPER and it works fine to get transcripts.
frame_caption = ""
prev_caption = ""
start_time = 0
end_time = 0
last_valid_dcs = ''
for i, j in enumerate(caption):
current_caption = f"{j}."
current_dcs = dcs.get(f"{i+1}", "")
if len(current_dcs) > 0:
last_valid_dcs = current_dcs
if current_caption == prev_caption:
end_time = i+1
else:
if prev_caption:
frame_caption += f"Second {start_time} - {end_time}: {prev_caption}{last_valid_dcs}\n"
start_time = i+1
end_time = i+1
prev_caption = current_caption
if prev_caption:
frame_caption += f"Second {start_time} - {end_time}: {prev_caption}{current_dcs}\n"
total_dur = end_time
frame_caption += f"| Total Duration: {total_dur} seconds.\n"
Better now. kiss goodbye, gotta go sleep, follow @aifredreacts on yt and tiktok for more
Thank you for your careful analysis and excellent suggestions! We added support for whisper in long_video_support, we will update our code with your suggestions!