Ask-Anything icon indicating copy to clipboard operation
Ask-Anything copied to clipboard

[Feature Request] Live Stream Video with Adjustable Prompt in Realtime 🔥

Open fire17 opened this issue 1 year ago • 1 comments

Hi there! First of all, let me say, this is cutting edge stuff, amazing

Wanted to ask, how can we do this on live video? And what should the expected fps to be? Aiming for realtime 30+ fps but even 10fps could work. The idea is to set a prompt (that can dynamically be changed midrun) and every frame of the video will have to respond to that prompt Let me give you some examples......

[Video stream of a dog's water bowl with a tap directly above it] You have access to an iot water tap and your responsibility is to monitor the water level in this container, make sure it is not overflowing or being filled while the container is absent. Trigger filling action water when the level is lower than 20% and stop at 90%. Make a Stop action immediately if there's a spill. Your output should be structured as follows: { container_present: {true/false} container_offcenter: {int in cm / none} water_level: {int in percent ranging 0-100%} action: {idle/fill/stop} event:{filling/full/empty/spill/dog_drinking} }

  • can stream from wifi camera and make sure my dogs always have full water 😇

[Live Video of a tree with fruits on it] {{this prompt can change based on current event and objective status, but for example}} You are remotely controlling an agricultural robot with the capacity to pick fruits. Current objective:

  • Locate the biggest cluster of ripe fruits on the tree Infront of us
  • Give directions to the robot to turn slightly left or right based the cluster side from the center of the frame
  • Give instructions to move forward and approach the cluster, and stoping when getting within 1 meter from the fruits.
  • Make sure the path is free from any obstacles, ropes, potholes, and or navigate around them.
  • Add observation notes Your output should be structured as follows: { ripe_cluster_size:{int count of ripe fruits in current cluster} turn: {left/right/center/up/down} travel: {stop/forward/backward} objectives_completed: {false/true} notes:{string with relevant events and information and agricultural insights} }
  • Generalized Agricultural autopilot based on dynamic general objectives 🔥

[Live Video of walking into a grocery shop, walking and occasionally zooming in on products, their prices and ingredients] {Ask Anything app is open and I'm streaming from my phone and asking different questions along the way, each time refering to something else. Using speech2text} Hey what is this? Is it any good? Which other things should I get here if I want to make sauce for this? How much does that cost? This is my entire cart [goes off listing and showing items] how much do you estimate my final cart price will be at checkout? (Make a bill of all items and sum their total)

Yea I hope you see what I mean, really mind blowing imo This could be the next phase shift Let me know what you think , if you like it , and how can we make this work ?

Thanks a lot and have a good one! All the best! 💜

fire17 avatar May 04 '23 22:05 fire17

Sounds Great! We have previously verified that it is possible to have the perception ability in long videos and has a certain orientation perception ability.

Using your prompt can make LLM understand what happened in long videos or live video streams properly! It's going to be a fun and useful application. Maybe We can add a good detector like SAM to our video processing, and it won't take too long, and then call the LLM as our control hub for the tasks the user intends to handle.

But we are new to Gradio, and are not very familiar with many functions. At present, we can only caption the video first, and then conduct questions and answers in the chatbox. We are learning and investigating how to execute these two steps asynchronously!

If you have any ideas and interests we can build this feature together!

yinanhe avatar May 05 '23 04:05 yinanhe