go-livepeer [AI] Add support for Object Detection pipeline

What does this pull request do? Explain your changes. (required)

Adds support for the ai-worker pipeline which implements the (real-time) RT-DETR object-detection model (default)

Corresponding AI-Worker PR: livepeer/ai-worker#243 Specific updates (required)

How did you test each of these updates (required)

Test was performed by running the gateway + worker locally.

Does this pull request close any open issues?

Checklist:

[X] Read the contribution guide
[X] make runs successfully
[ ] All tests in ./test.sh pass
[ ] README and other documentation updated
[ ] Pending changelog updated

cc @rickstaa

Nov 01 '24 15:11 RUFFY-369

Did an initial look through and it looks pretty good. Will update tomorrow when I can pull and run it locally. Thank you for adding tests!

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

Nov 18 '24 02:11 ad-astra-video

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

I don't have enough context to answer this. @rickstaa may know more.

Nov 18 '24 09:11 leszko

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

If @rickstaa comments on the future of cpu transcoding then I can push a commit adding labels_only header to the API to avoid calling transcodeFrames. PS If agreed after discussion then the header can be added after performing E2E testing of the pipeline in this state of the PR.

Nov 18 '24 16:11 RUFFY-369

@leszko, @ad-astra-video I haven’t planned to replace the CPU for transcoding yet. I was considering holding off on that until it becomes necessary for the real-time version of this pipeline. I think for now @ad-astra-video's solution makes sense.

Nov 18 '24 16:11 rickstaa

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.
Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update. 2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Please update to add:

Return the detection data to the user, it is dropped with only the video returning right now.
Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Questions:

Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?
Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

If you want to try out, my docker builds are adastravideo/go-livepeer:object-detection and adastravideo/ai-runner:object-detection. Attaching the result of one detection run for reference:

https://github.com/user-attachments/assets/5de6c2c1-705f-4e96-9103-464eb5ec9822

Nov 20 '24 04:11 ad-astra-video

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.

Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update. 2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Thanks for the PR as it was also a TODO for me to get all frames into one mp4. I have merged those changes already :+1:

Please update to add:

Return the detection data to the user, it is dropped with only the video returning right now. Done in the recent commits. Please have a E2E run on your side for cross validation :rocket:

Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

Questions:

Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it. Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height*width*frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

Hmmm..regarding rendering boxes client-side, I’ll need to explore this further as it’s an area I haven’t deeply worked on yet. Availability of choice is a better option as annotation of the frames can be done with detected output data as well out of the pipeline loop.

Nov 23 '24 16:11 RUFFY-369

Also @ad-astra-video If you could cross-check on your side as well with an E2E run with the recent commits that were pushed. Otherwise I have addressed all the requested changes. :+1: Thanks

Nov 23 '24 16:11 RUFFY-369

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(heightwidthframes) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it. Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p

2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds

360p note: annotating the frames adds about 1 second to detections time in this 10 second video

2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

Nov 27 '24 17:11 ad-astra-video

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

I have changed the ImageResponse result to ObjectDetectionResponse as output in the past commits after you pointed it out.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height_width_frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

I think that overall pricing should get an update for all the pipelines by considering a subtle and not much complex combination of different metric. But for now I am leaving the pricing to be based on pixel similar to other pipelines like you mentioned. :+1:

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it. Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p
2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds
360p note: annotating the frames adds about 1 second to detections time in this 10 second video
2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

Thank you for the clarification! I presumed as well that maybe you could be asking in resolution context too. As I also mentioned in the previous reply that I noticed not for the same video but for different videos that resolution plays more important role than the overall size(frames res X time) of the video input in complete inference time. Which means for user to get quick result, they for sure have to decrease their video res and not the number of video seconds.

The data you provided gives quite nice insights. Hmmm...as previously discussed annotations can be added as optional functionality in the api.

Nov 28 '24 20:11 RUFFY-369

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

Nov 28 '24 20:11 RUFFY-369

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Nov 29 '24 20:11 ad-astra-video

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged. The updates that you requested in the ai-worker PR comment right?! I will get them done just now. Let me just get both of them done so that you can review the changes and then lets get this pipeline merged :pray:

Nov 29 '24 20:11 RUFFY-369

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum. This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged. The updates that you requested in the ai-worker PR comment right?! I will get them done just now. Let me just get both of them done so that you can review the changes and then lets get this pipeline merged 🙏

@ad-astra-video I have made the requested changes in the ai-worker repo and made the corresponding changes in this PR to support them. You can have a look :+1: :rocket:

Dec 01 '24 12:12 RUFFY-369

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

Jan 02 '25 20:01 ad-astra-video

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

@ad-astra-video I have merged your PR. Thanks! And also rebased this to master :+1:

Jan 06 '25 07:01 RUFFY-369