twilio-video.js icon indicating copy to clipboard operation
twilio-video.js copied to clipboard

[VideoProcessor API] OffscreenCanvas limitation and API suggestions?

Open Dosant opened this issue 3 years ago • 32 comments

Thanks for adding the VideoProcessor API, very handy and convenient to have. I added blurring/replacing the video background into my app through this API and everything seem to be working just fine 👍 tweet for the curious

OffscreenCanvas Limitation ?

Why does the API expose OffscreenCanvas instead of HTMLCanvasElement?

I didn't find in the docs an explanation of this design decision. Also, examples that you've implemented don't seem to be doing anything with OffscreenCanvas that isn't possible with HTMLCanvasElement

I understand this is the reason why this API only works in Chrome. Not very clear though why limiting the API by using OffscreenCanvas. Would love to understand a bigger picture 🙏

My implementation on top of the current API

As I understand OffscreenCanvas that the API provides was meant to be transferred into a web-worker to offload the work from a separate thread. But the API provides already locked OffscreenCanvas so it isn't possible to transfer it as is.

In my case, I ended up calling inputFrame.transferToImageBitmap() and transferring an ImageBitmap into a web-worker that was then drawn into the worker's own OffscreenCanvas instance. Then I did the same for transferring resulting in ImageBitmap back.

This seems to be suboptimal, and I could do the same if the API would provide HTMLCanvasElement and probably would make it work in other browsers than chrome.

Suggestions to the API

  • It seems there is no reason (at least for now) to provide OffscreenCanvas in the API instead of HTMLCanvasElement
  • I wonder if the API should provide blank outputFrame: HTMLCanvasElement where a new image should be drawn to. The API will provide it without locking it. Then consumer could call .transferControlToOffscreen() and pass that connected OffscreenCanvas instance to the web-worker.
  • Would be great if examples went further than simple CSS filters. But actually integrated OffscreenCanvas/ web-workers / tensorflow / etc
  • Would be great to an example in the docs why did you go with OffscreenCanvas and what is a longer-term roadmap for this API

Please let me know if I am missing something here and didn't get how ideally OffscreenCanvas should be leveraged.

Dosant avatar Mar 09 '21 23:03 Dosant

Hi @Dosant ,

Thanks for trying out the VideoProcessor API. Regarding your question about OffscreenCanvas, this API is in pilot/alpha phase, so we are open to changing the API based on customer feedback. The reasons we went with OffscreenCanvas were:

  • We wanted to support Chrome only initially because most of our initial customer interest has been for Chrome. Also, we wanted to focus all our resources (engineering, QA) on getting this API working properly and performant on Chrome. We also did not want to field questions regarding performance issues and problems on other browsers at this point, since this API is in a very early stage.
  • We wanted to focus our initial support for this API on single-threaded applications. I understand you are trying to use web workers, but this alpha version satisfies a lot of the existing demand for video processing. We will definitely take your feedback into account going forward so that web workers are supported by this API before GA.

Thanks,

Manjesh

manjeshbhargav avatar Mar 10 '21 00:03 manjeshbhargav

Hi @Dosant ,

Regarding your statement:

In my case, I ended up calling inputFrame.transferToImageBitmap() and transferring an ImageBitmap into a web-worker that was then drawn into the worker's own OffscreenCanvas instance. Then I did the same for transferring resulting in ImageBitmap back.

This seems to be suboptimal, and I could do the same if the API would provide HTMLCanvasElement and probably would make it work in other browsers than chrome.

Even if inputFrame was a HTMLCanvasElement, you would still not be able to transfer control to the web worker because the main thread (SDK) would have created a rendering context in order to paint it with the current input frame from the video track. So, you would still have to pass a bitmap to the web worker. I think passing the bitmap is better than passing the raw image data from getImageData(). So I think you are doing the right thing there.

Thank you for the detailed feedback. This is very useful for us in terms of calibrating the API in the near future.

Thanks,

Manjesh

manjeshbhargav avatar Mar 10 '21 01:03 manjeshbhargav

Even if inputFrame was a HTMLCanvasElement, you would still not be able to transfer control to the web worker because the main thread (SDK) would have created a rendering context in order to paint it with the current input frame from the video track. So, you would still have to pass a bitmap to the web worker. I think passing the bitmap is better than passing the raw image data from getImageData(). So I think you are doing the right thing there.

Right, I agree that I'd anyway have to pass the inputFrameto a worker somehow The API still could have exposed HTMLCanvasElement and then the consumer code could draw it to their own OffscreenCanvas and then pass it as ImageBitmap. In this case, consumer code could also fallback to a less performant main thread version using HTMLCanvasElement.

For the outputFrame would be very interesting to check if transferControlToOffscreen approach would perform better than passing ImageBitmap. Please note, I didn't compare performance.

We wanted to focus our initial support for this API on single-threaded applications. I understand you are trying to use web workers, but this alpha version satisfies a lot of the existing demand for video processing.

Just a note. It is working in a web-worker 🥳 Just some minor workarounds and figuring out how to make it work (no example)

Dosant avatar Mar 10 '21 17:03 Dosant

@Dosant ,

Thanks for the clarification. We will add a QuickStart example that demonstrates web workers soon.

Thanks,

Manjesh

manjeshbhargav avatar Mar 10 '21 17:03 manjeshbhargav

My suggestion for this API would be to simply provide access to the mediaStream object and let the implementer determine what to do with it.

// mediaStream -> processor -> mediaStream
function processor(mediaStream) {
  return mediaStream;
}

We are essentially doing this using a getUserMedia hack where we provide our own method for getUserMedia in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:

createLocalVideoTrack({
  async getUserMedia(constraints) {
    const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints);

    const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream);

    return canvasMediaStream;
  }
}

markbrouch avatar Mar 12 '21 22:03 markbrouch

@manjeshbhargav Can we expect a example or documentation for what @Dosant has implemented using Video Processor API ?

I'm trying to implement same with https://github.com/twilio/twilio-video-app-react/ also I've raised the question here https://github.com/twilio/twilio-video-app-react/issues/453

Thanks.

SanjayBikhchandani avatar Mar 16 '21 06:03 SanjayBikhchandani

@SanjayBikhchandani ,

Our examples focus on demonstrating the use of the SDK APIs, so we typically tend to keep our examples simple so that developers don't have to read through a lot of code to get to the API usage. However, you can use the VideoProcessor APIs in conjunction with libraries such as bodyPix in order to achieve background substitution/replacement.

Thanks,

Manjesh

manjeshbhargav avatar Mar 16 '21 16:03 manjeshbhargav

@markbrouch ,

My suggestion for this API would be to simply provide access to the mediaStream object and let the implementer determine what to do with it.

// mediaStream -> processor -> mediaStream
function processor(mediaStream) {
  return mediaStream;
}

We are essentially doing this using a getUserMedia hack where we provide our own method for getUserMedia in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:

createLocalVideoTrack({
  async getUserMedia(constraints) {
    const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints);

    const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream);

    return canvasMediaStream;
  }
}

We don't use MediaStreams in our SDK, we only operate on MediaStreamTracks. Since Tensorflow models operate on individual frames, you can use the existing processFrame(inputCanvas) method to pass the contents of the canvas to the Tensorflow model. The approach you have suggested will not work for us since we need to update the LocalVideoTrack's attached elements with the processed feed and also update the corresponding MediaStreamTracks being published to the Room.

Thanks,

Manjesh

manjeshbhargav avatar Mar 18 '21 23:03 manjeshbhargav

@markbrouch ,

My suggestion for this API would be to simply provide access to the mediaStream object and let the implementer determine what to do with it.

// mediaStream -> processor -> mediaStream
function processor(mediaStream) {
  return mediaStream;
}

We are essentially doing this using a getUserMedia hack where we provide our own method for getUserMedia in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:

createLocalVideoTrack({
  async getUserMedia(constraints) {
    const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints);

    const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream);

    return canvasMediaStream;
  }
}

We don't use MediaStreams in our SDK, we only operate on MediaStreamTracks. Since Tensorflow models operate on individual frames, you can use the existing processFrame(inputCanvas) method to pass the contents of the canvas to the Tensorflow model. The approach you have suggested will not work for us since we need to update the LocalVideoTrack's attached elements with the processed feed and also update the corresponding MediaStreamTracks being published to the Room.

Thanks,

Manjesh

Thanks @manjeshbhargav,

Substitute mediaStreamTrack for mediaStream in my example and the main point remains. I think the main problem with processFrame as it currently exists is that it makes use of offscreenCanvas, which has poor browser support currently. By being less prescriptive with the processFrame API and allowing the application to directly handle the mediaStreamTrack, we wouldn't have that restriction. In our solution we are piping the mediaStreamTrack through TF and performing our own canvas transformations using a normal canvas, which allows us to support Safari and Firefox in addition to Chrome.

markbrouch avatar Mar 19 '21 21:03 markbrouch

@markbrouch ,

Right now, we are limiting our support to Chrome because we are in the pilot/beta phase and we need to fine-tune our implementation to make it more performant. We do intend to support all browsers by the time we go to GA (sometime in Q2). The reason why we designed the VideoProcessor API this way is to allow the developers to focus only on implementing the logic to process frames and not have to worry about updating the preview elements and the published track (the SDK does all that for you). Also, if you want to pipe your own MediaStreamTrack, you can achieve that easily without the VideoProcessor API like so:

const { LocalVideoTrack } = require('twilio-video');

const stream = await navigator.mediaDevices.getUserMedia({ video: true });
const videoTrack = stream.getTracks()[0];
const processedVideoTrack = processVideoTrack(videoTrack);
const twilioVideoTrack = new LocalVideoTrack(processedVideoTrack);

Thanks,

Manjesh

manjeshbhargav avatar Mar 23 '21 17:03 manjeshbhargav

Thanks @manjeshbhargav , I'm excited to use this feature when it gains broader browser support!

markbrouch avatar Mar 24 '21 17:03 markbrouch

Just to add to the thread: I implemented something similar to @Dosant using BodyPix:

  • in the video processor processFrame converted the frame to ImageBitmap
  • sent to worker for segmentation
  • here, I think we differ a little, worker sends prediction data back to main thread (and not image data).
  • main threads composes output frame

Issue is that I couldn't get any more than 10-12 FPS on average (@Dosant where you able to get anything better) using an average machine (whether using a worker or not).

This is a great repo I stumbled upon doing some research. First of all @w-okada "worker-ized" a lot of common video processing libraries 👏 👏 👏 - including BodyBix. Running his demo for BodyPix I get slightly better FPS (don't understand why - need to dive into it) but more importantly he also provided a worker for Google Meet TFLite model which is much faster (20-25FPS on same machine) and precise. Haven't tried it out with Twilio Video yet.

@manjeshbhargav, as to the processor API, my 2 cents:

  • Don't mind receiving the input as a canvas but obviously OfflineCanvas will not work in Safari (where most of my customers are). Looking forward to an improvement here.
  • Do like the API to be frame based and not stream based. In any case I'd break down the stream to frame-by-frame for processing.

shaibt avatar Apr 12 '21 13:04 shaibt

@shaibt Thanks for introducing my repos. Yes, I have Google Meet Model, but the model is currently not under APACHE-2.0 license.

please see https://github.com/tensorflow/tfjs/issues/4177

w-okada avatar Apr 13 '21 02:04 w-okada

I implemented something similar to @Dosant using BodyPix:

@shaibt Is it possible you could post a code snippet of your VideoProcessor that uses BodyPix please?

RyanDurkin avatar Apr 15 '21 14:04 RyanDurkin

Is there a code sample which uses BodyPix with VideoProcessor ? Cant find any way to send <video> element to bodyPix and set the output canvas as localStream for the remotePeerConnection

adityajoshee avatar May 10 '21 14:05 adityajoshee

Hi @adityajoshee ,

You can write the contents of the OffScreenCanvas input frame that you get in the processFrame() callback into an HTMLCanvasElement, and then pass it to BodyPix's segmentPerson() method. Let me know if it works for you.

Thanks,

Manjesh

manjeshbhargav avatar May 11 '21 01:05 manjeshbhargav

Basically I'm trying to add background blur using BodyPix and add that as a video track to local participant in the joinRoom function, like this -

  let localCanvas = document.getElementById('localCanvas');
  let localStream =  localCanvas.captureStream(10)
  const track = new Twilio.Video.LocalVideoTrack(localStream.getVideoTracks()[0]);
  console.log('....*......')
  await room.localParticipant.publishTrack(track, {
    name: 'canvasStream',
    priority: 'low',
  } );

But I get

TypeError: track must be a LocalAudioTrack, LocalVideoTrack, LocalDataTrack, or MediaStreamTrack

adityajoshee avatar May 11 '21 14:05 adityajoshee

Update: The original APACHE-2.0 license for the Google Meets segmentation model was found. I would also like to mention that it is used by Jitsi, so perhaps you can take a look at their code. On @w-okada 's example, I am able to achieve 100 fps on desktop with the 256x256 model. 256x256 process size, and SIMD. The models themselves can be found here.

Here is an article on it by @w-okada.

kirawi avatar May 14 '21 18:05 kirawi

@manjeshbhargav it will be very helpful if you can share how to use the videoprocessor for background blur with actual code using bodypix or any other lib for that matter.

aditya-protonn avatar May 20 '21 04:05 aditya-protonn

For background blur, I wrote a React hook for this (not using VideoProcessor API for wider browser support). Offscreen Canvas has almost the same API, so we can do it in a similar way

https://gist.github.com/acro5piano/6f16fa332416479b9edadccc71b4bc25

acro5piano avatar May 21 '21 07:05 acro5piano

Blur and virtual background is now officially supported. Please see this announcement which should answer most of the questions in this ticket.

charliesantos avatar Jun 25 '21 14:06 charliesantos

@charliesantos Tried out the new features. Appreciate the effort in adding blur and background removal as part of the twilio-video packaged solution but I'm not sure these are usable in real commercial apps.

This is not a criticism of Twilio at all! relevant for any TFLITE implementation with the popular models: IMHO, even with WASM SIMD performance isn't good enough:

  • OS X Big Sur + 2.2Ghz quad core i7 + 16GB RAM (not truly a high end machine but probably much stronger machine than any of our customers would use)
  • Chrome 91.0.4472.114
  • Video at 720p (Twilio recommendation for 640x480 is not realistic for the real world)

Video FPS w/o blur = 25 Video FPS w/ blur = 10...too low

PS: I took a short look into the code and surprised to see that the segmentation is not done in a background web worker...perhaps that would help a little with the FPS?

shaibt avatar Jun 29 '21 13:06 shaibt

Hey @shaibt, thank you for trying out the new features! Sorry to hear you are not getting high enough FPS. But with your configuration, you should be able to push it up to 60FPS. I have a few questions:

  • Are you able to see the same performance in our live demo?
  • How are you measuring the frame rate? Any specific tool you're using? Are you measuring timings?
  • Do you have a test app deployed that I can try out?
  • Do you have room sids I can inspect?

In regards to web worker, it's something that we're considering in the future.

charliesantos avatar Jun 29 '21 14:06 charliesantos

Hey @charliesantos,

I ran several more tests and for all but 1, I'm still getting reduced frame rate when blur processor is applied to local video stream:

  • as a test app, I am using a react app very similar to the twilio-video react reference app but only allowing a single participant to connect to an ad-hoc group room.
  • Its hard for me to tell if the live demo is suffering from same performance issue as its simply rendering the local video with the processor w/o any RTP encoding and transmission. From a "naked eye test" it does look like FPS is reduced slightly.
  • I am measuring FPS on the outbound RTP using chrome://webrtc-internals (see screenshots attached below) - you can clearly see the "hit" on FPS once I enable the blur processor a few mins into the session. The FPS goes from steady 25fps to unstable 10-18fps.
  • Here are 2 room sids that corollate to the performance data screenshots: RM54d863a3ea2711e5510469d36d7b49a1, RM934b1737324b20e9b4ab57a91af602ff.

I had one occurrence when FPS was not affected at all by the processor but I could not recreate it nor understand what was different about the setup to enable it - seemed random.

Screen Shot 2021-07-01 at 18 04 15
Screen Shot 2021-07-01 at 17 58 08

shaibt avatar Jul 01 '21 15:07 shaibt

@shaibt , thanks for providing more details. Looking at your rooms, it seems the bottleneck in your configuration is on the image resizing part. It's taking about 30ms to 55ms which is bringing down the frame rate. This usually happens if you don't have enough graphics processing power. And a bigger capture resolution will also make this worse. Just curious, what are your GPU specs?

PS: Your CPU is powerful enough that segmentation only takes about 5ms on average base on the data I captured from your room.

charliesantos avatar Jul 01 '21 16:07 charliesantos

@charliesantos its the default MacBook 17" built in GPU. I'm running the GPU history graph (OSX Activity Monitor) in parallel to the blur processor and it doesn't look like its exerting a serious effort. Can you check this room to see if there's an improvement (looks good on my side): RM0fb62d2190149a8b6aab3e89bf694e33

shaibt avatar Jul 01 '21 16:07 shaibt

Hey @shaibt I'm not seeing anything for that room. WS must have been disconnected. You can check the video-processor stats yourself by looking at the chrome debugger, network tab, WS. Look for the "stats" event. See screenshot below.

Screen Shot 2021-07-01 at 9 45 22 AM

charliesantos avatar Jul 01 '21 16:07 charliesantos

Hi @charliesantos, Thanks for the info re video processors stats. So far over last 24H everything looks fine - GPU usage, video FPS and processing delay stats. Still don't understand why we're experiencing inconsistency from test to test. I'll report back if we find out anything

shaibt avatar Jul 02 '21 10:07 shaibt

Would like to bump this thread to get a better understanding of where support for video processors in the twilio-video SDK is currently at.

My case is pretty simple. I am attempting to draw a timestamp at the bottom of a user's video. Is it true that twilio-video exclusively uses OffscreenCanvas to do this work?

There currently is some work being done on OffscreenCanvas in Safari Technology Preview but I don't think the feature has been activated yet.

BrandonMathis avatar Jul 11 '22 20:07 BrandonMathis

Hi @BrandonMathis we are currently investigating what it would take to support this on other browsers. Most likely, we will use HTMLCanvasElement on browsers that doesn't support OffscreenCanvas.

charliesantos avatar Jul 11 '22 21:07 charliesantos