twilio-video.js
twilio-video.js copied to clipboard
[VideoProcessor API] OffscreenCanvas limitation and API suggestions?
Thanks for adding the VideoProcessor API, very handy and convenient to have. I added blurring/replacing the video background into my app through this API and everything seem to be working just fine 👍 tweet for the curious
OffscreenCanvas
Limitation ?
Why does the API expose OffscreenCanvas
instead of HTMLCanvasElement
?
I didn't find in the docs an explanation of this design decision. Also, examples that you've implemented don't seem to be doing anything with OffscreenCanvas
that isn't possible with HTMLCanvasElement
I understand this is the reason why this API only works in Chrome. Not very clear though why limiting the API by using OffscreenCanvas
. Would love to understand a bigger picture 🙏
My implementation on top of the current API
As I understand OffscreenCanvas
that the API provides was meant to be transferred into a web-worker to offload the work from a separate thread. But the API provides already locked OffscreenCanvas
so it isn't possible to transfer it as is.
In my case, I ended up calling inputFrame.transferToImageBitmap()
and transferring an ImageBitmap
into a web-worker that was then drawn into the worker's own OffscreenCanvas
instance. Then I did the same for transferring resulting in ImageBitmap
back.
This seems to be suboptimal, and I could do the same if the API would provide HTMLCanvasElement
and probably would make it work in other browsers than chrome.
Suggestions to the API
- It seems there is no reason (at least for now) to provide
OffscreenCanvas
in the API instead ofHTMLCanvasElement
- I wonder if the API should provide blank
outputFrame: HTMLCanvasElement
where a new image should be drawn to. The API will provide it without locking it. Then consumer could call.transferControlToOffscreen()
and pass that connectedOffscreenCanvas
instance to the web-worker. - Would be great if examples went further than simple CSS filters. But actually integrated
OffscreenCanvas
/ web-workers / tensorflow / etc - Would be great to an example in the docs why did you go with
OffscreenCanvas
and what is a longer-term roadmap for this API
Please let me know if I am missing something here and didn't get how ideally OffscreenCanvas
should be leveraged.
Hi @Dosant ,
Thanks for trying out the VideoProcessor API. Regarding your question about OffscreenCanvas, this API is in pilot/alpha phase, so we are open to changing the API based on customer feedback. The reasons we went with OffscreenCanvas were:
- We wanted to support Chrome only initially because most of our initial customer interest has been for Chrome. Also, we wanted to focus all our resources (engineering, QA) on getting this API working properly and performant on Chrome. We also did not want to field questions regarding performance issues and problems on other browsers at this point, since this API is in a very early stage.
- We wanted to focus our initial support for this API on single-threaded applications. I understand you are trying to use web workers, but this alpha version satisfies a lot of the existing demand for video processing. We will definitely take your feedback into account going forward so that web workers are supported by this API before GA.
Thanks,
Manjesh
Hi @Dosant ,
Regarding your statement:
In my case, I ended up calling
inputFrame.transferToImageBitmap()
and transferring anImageBitmap
into a web-worker that was then drawn into the worker's ownOffscreenCanvas
instance. Then I did the same for transferring resulting inImageBitmap
back.
This seems to be suboptimal, and I could do the same if the API would provide
HTMLCanvasElement
and probably would make it work in other browsers than chrome.
Even if inputFrame
was a HTMLCanvasElement
, you would still not be able to transfer control to the web worker because the main thread (SDK) would have created a rendering context in order to paint it with the current input frame from the video track. So, you would still have to pass a bitmap to the web worker. I think passing the bitmap is better than passing the raw image data from getImageData()
. So I think you are doing the right thing there.
Thank you for the detailed feedback. This is very useful for us in terms of calibrating the API in the near future.
Thanks,
Manjesh
Even if inputFrame was a HTMLCanvasElement, you would still not be able to transfer control to the web worker because the main thread (SDK) would have created a rendering context in order to paint it with the current input frame from the video track. So, you would still have to pass a bitmap to the web worker. I think passing the bitmap is better than passing the raw image data from getImageData(). So I think you are doing the right thing there.
Right, I agree that I'd anyway have to pass the inputFrame
to a worker somehow
The API still could have exposed HTMLCanvasElement
and then the consumer code could draw it to their own OffscreenCanvas
and then pass it as ImageBitmap
. In this case, consumer code could also fallback to a less performant main thread version using HTMLCanvasElement
.
For the outputFrame
would be very interesting to check if transferControlToOffscreen
approach would perform better than passing ImageBitmap
. Please note, I didn't compare performance.
We wanted to focus our initial support for this API on single-threaded applications. I understand you are trying to use web workers, but this alpha version satisfies a lot of the existing demand for video processing.
Just a note. It is working in a web-worker 🥳 Just some minor workarounds and figuring out how to make it work (no example)
@Dosant ,
Thanks for the clarification. We will add a QuickStart example that demonstrates web workers soon.
Thanks,
Manjesh
My suggestion for this API would be to simply provide access to the mediaStream
object and let the implementer determine what to do with it.
// mediaStream -> processor -> mediaStream
function processor(mediaStream) {
return mediaStream;
}
We are essentially doing this using a getUserMedia
hack where we provide our own method for getUserMedia
in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:
createLocalVideoTrack({
async getUserMedia(constraints) {
const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints);
const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream);
return canvasMediaStream;
}
}
@manjeshbhargav Can we expect a example or documentation for what @Dosant has implemented using Video Processor API ?
I'm trying to implement same with https://github.com/twilio/twilio-video-app-react/ also I've raised the question here https://github.com/twilio/twilio-video-app-react/issues/453
Thanks.
@SanjayBikhchandani ,
Our examples focus on demonstrating the use of the SDK APIs, so we typically tend to keep our examples simple so that developers don't have to read through a lot of code to get to the API usage. However, you can use the VideoProcessor APIs in conjunction with libraries such as bodyPix in order to achieve background substitution/replacement.
Thanks,
Manjesh
@markbrouch ,
My suggestion for this API would be to simply provide access to the
mediaStream
object and let the implementer determine what to do with it.// mediaStream -> processor -> mediaStream function processor(mediaStream) { return mediaStream; }
We are essentially doing this using a
getUserMedia
hack where we provide our own method forgetUserMedia
in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:createLocalVideoTrack({ async getUserMedia(constraints) { const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints); const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream); return canvasMediaStream; } }
We don't use MediaStreams in our SDK, we only operate on MediaStreamTracks. Since Tensorflow models operate on individual frames, you can use the existing processFrame(inputCanvas)
method to pass the contents of the canvas to the Tensorflow model. The approach you have suggested will not work for us since we need to update the LocalVideoTrack's attached elements with the processed feed and also update the corresponding MediaStreamTracks being published to the Room.
Thanks,
Manjesh
@markbrouch ,
My suggestion for this API would be to simply provide access to the
mediaStream
object and let the implementer determine what to do with it.// mediaStream -> processor -> mediaStream function processor(mediaStream) { return mediaStream; }
We are essentially doing this using a
getUserMedia
hack where we provide our own method forgetUserMedia
in the media track constraints object so that we can pipe the camera media stream track into tensorflow, do some image manipulation on canvas, and then return the canvas's media stream track:createLocalVideoTrack({ async getUserMedia(constraints) { const cameraMediaStream = await navigator.mediaDevices.getUserMedia(constraints); const canvasMediaStream = getCanvasFromTensorFlowManipulation(cameraMediaStream); return canvasMediaStream; } }
We don't use MediaStreams in our SDK, we only operate on MediaStreamTracks. Since Tensorflow models operate on individual frames, you can use the existing
processFrame(inputCanvas)
method to pass the contents of the canvas to the Tensorflow model. The approach you have suggested will not work for us since we need to update the LocalVideoTrack's attached elements with the processed feed and also update the corresponding MediaStreamTracks being published to the Room.Thanks,
Manjesh
Thanks @manjeshbhargav,
Substitute mediaStreamTrack
for mediaStream
in my example and the main point remains. I think the main problem with processFrame
as it currently exists is that it makes use of offscreenCanvas
, which has poor browser support currently. By being less prescriptive with the processFrame
API and allowing the application to directly handle the mediaStreamTrack
, we wouldn't have that restriction. In our solution we are piping the mediaStreamTrack
through TF and performing our own canvas transformations using a normal canvas
, which allows us to support Safari and Firefox in addition to Chrome.
@markbrouch ,
Right now, we are limiting our support to Chrome because we are in the pilot/beta phase and we need to fine-tune our implementation to make it more performant. We do intend to support all browsers by the time we go to GA (sometime in Q2). The reason why we designed the VideoProcessor API this way is to allow the developers to focus only on implementing the logic to process frames and not have to worry about updating the preview elements and the published track (the SDK does all that for you). Also, if you want to pipe your own MediaStreamTrack, you can achieve that easily without the VideoProcessor API like so:
const { LocalVideoTrack } = require('twilio-video');
const stream = await navigator.mediaDevices.getUserMedia({ video: true });
const videoTrack = stream.getTracks()[0];
const processedVideoTrack = processVideoTrack(videoTrack);
const twilioVideoTrack = new LocalVideoTrack(processedVideoTrack);
Thanks,
Manjesh
Thanks @manjeshbhargav , I'm excited to use this feature when it gains broader browser support!
Just to add to the thread: I implemented something similar to @Dosant using BodyPix:
- in the video processor
processFrame
converted the frame to ImageBitmap - sent to worker for segmentation
- here, I think we differ a little, worker sends prediction data back to main thread (and not image data).
- main threads composes output frame
Issue is that I couldn't get any more than 10-12 FPS on average (@Dosant where you able to get anything better) using an average machine (whether using a worker or not).
This is a great repo I stumbled upon doing some research. First of all @w-okada "worker-ized" a lot of common video processing libraries 👏 👏 👏 - including BodyBix. Running his demo for BodyPix I get slightly better FPS (don't understand why - need to dive into it) but more importantly he also provided a worker for Google Meet TFLite model which is much faster (20-25FPS on same machine) and precise. Haven't tried it out with Twilio Video yet.
@manjeshbhargav, as to the processor API, my 2 cents:
- Don't mind receiving the input as a canvas but obviously OfflineCanvas will not work in Safari (where most of my customers are). Looking forward to an improvement here.
- Do like the API to be frame based and not stream based. In any case I'd break down the stream to frame-by-frame for processing.
@shaibt Thanks for introducing my repos. Yes, I have Google Meet Model, but the model is currently not under APACHE-2.0 license.
please see https://github.com/tensorflow/tfjs/issues/4177
I implemented something similar to @Dosant using BodyPix:
@shaibt Is it possible you could post a code snippet of your VideoProcessor that uses BodyPix please?
Is there a code sample which uses BodyPix with VideoProcessor ? Cant find any way to send <video>
element to bodyPix and set the output canvas as localStream for the remotePeerConnection
Hi @adityajoshee ,
You can write the contents of the OffScreenCanvas
input frame that you get in the processFrame()
callback into an HTMLCanvasElement
, and then pass it to BodyPix's segmentPerson()
method. Let me know if it works for you.
Thanks,
Manjesh
Basically I'm trying to add background blur using BodyPix and add that as a video track to local participant in the joinRoom function, like this -
let localCanvas = document.getElementById('localCanvas');
let localStream = localCanvas.captureStream(10)
const track = new Twilio.Video.LocalVideoTrack(localStream.getVideoTracks()[0]);
console.log('....*......')
await room.localParticipant.publishTrack(track, {
name: 'canvasStream',
priority: 'low',
} );
But I get
TypeError: track must be a LocalAudioTrack, LocalVideoTrack, LocalDataTrack, or MediaStreamTrack
Update: The original APACHE-2.0 license for the Google Meets segmentation model was found. I would also like to mention that it is used by Jitsi, so perhaps you can take a look at their code. On @w-okada 's example, I am able to achieve 100 fps on desktop with the 256x256 model. 256x256 process size, and SIMD. The models themselves can be found here.
@manjeshbhargav it will be very helpful if you can share how to use the videoprocessor for background blur with actual code using bodypix or any other lib for that matter.
For background blur, I wrote a React hook for this (not using VideoProcessor API for wider browser support). Offscreen Canvas has almost the same API, so we can do it in a similar way
https://gist.github.com/acro5piano/6f16fa332416479b9edadccc71b4bc25
Blur and virtual background is now officially supported. Please see this announcement which should answer most of the questions in this ticket.
@charliesantos Tried out the new features. Appreciate the effort in adding blur and background removal as part of the twilio-video
packaged solution but I'm not sure these are usable in real commercial apps.
This is not a criticism of Twilio at all! relevant for any TFLITE implementation with the popular models: IMHO, even with WASM SIMD performance isn't good enough:
- OS X Big Sur + 2.2Ghz quad core i7 + 16GB RAM (not truly a high end machine but probably much stronger machine than any of our customers would use)
- Chrome 91.0.4472.114
- Video at 720p (Twilio recommendation for 640x480 is not realistic for the real world)
Video FPS w/o blur = 25 Video FPS w/ blur = 10...too low
PS: I took a short look into the code and surprised to see that the segmentation is not done in a background web worker...perhaps that would help a little with the FPS?
Hey @shaibt, thank you for trying out the new features! Sorry to hear you are not getting high enough FPS. But with your configuration, you should be able to push it up to 60FPS. I have a few questions:
- Are you able to see the same performance in our live demo?
- How are you measuring the frame rate? Any specific tool you're using? Are you measuring timings?
- Do you have a test app deployed that I can try out?
- Do you have room sids I can inspect?
In regards to web worker, it's something that we're considering in the future.
Hey @charliesantos,
I ran several more tests and for all but 1, I'm still getting reduced frame rate when blur processor is applied to local video stream:
- as a test app, I am using a react app very similar to the twilio-video react reference app but only allowing a single participant to connect to an ad-hoc group room.
- Its hard for me to tell if the live demo is suffering from same performance issue as its simply rendering the local video with the processor w/o any RTP encoding and transmission. From a "naked eye test" it does look like FPS is reduced slightly.
- I am measuring FPS on the outbound RTP using
chrome://webrtc-internals
(see screenshots attached below) - you can clearly see the "hit" on FPS once I enable the blur processor a few mins into the session. The FPS goes from steady 25fps to unstable 10-18fps. - Here are 2 room sids that corollate to the performance data screenshots:
RM54d863a3ea2711e5510469d36d7b49a1
,RM934b1737324b20e9b4ab57a91af602ff
.
I had one occurrence when FPS was not affected at all by the processor but I could not recreate it nor understand what was different about the setup to enable it - seemed random.
![Screen Shot 2021-07-01 at 18 04 15](https://user-images.githubusercontent.com/25963448/124152445-3393a700-da9c-11eb-992d-6362075168de.png)
![Screen Shot 2021-07-01 at 17 58 08](https://user-images.githubusercontent.com/25963448/124152458-368e9780-da9c-11eb-938e-8744a4cf1c74.png)
@shaibt , thanks for providing more details. Looking at your rooms, it seems the bottleneck in your configuration is on the image resizing part. It's taking about 30ms to 55ms which is bringing down the frame rate. This usually happens if you don't have enough graphics processing power. And a bigger capture resolution will also make this worse. Just curious, what are your GPU specs?
PS: Your CPU is powerful enough that segmentation only takes about 5ms on average base on the data I captured from your room.
@charliesantos its the default MacBook 17" built in GPU.
I'm running the GPU history graph (OSX Activity Monitor) in parallel to the blur processor and it doesn't look like its exerting a serious effort. Can you check this room to see if there's an improvement (looks good on my side): RM0fb62d2190149a8b6aab3e89bf694e33
Hey @shaibt I'm not seeing anything for that room. WS must have been disconnected. You can check the video-processor stats yourself by looking at the chrome debugger, network tab, WS. Look for the "stats" event. See screenshot below.
Hi @charliesantos, Thanks for the info re video processors stats. So far over last 24H everything looks fine - GPU usage, video FPS and processing delay stats. Still don't understand why we're experiencing inconsistency from test to test. I'll report back if we find out anything
Would like to bump this thread to get a better understanding of where support for video processors in the twilio-video SDK is currently at.
My case is pretty simple. I am attempting to draw a timestamp at the bottom of a user's video. Is it true that twilio-video exclusively uses OffscreenCanvas to do this work?
There currently is some work being done on OffscreenCanvas in Safari Technology Preview but I don't think the feature has been activated yet.
Hi @BrandonMathis we are currently investigating what it would take to support this on other browsers. Most likely, we will use HTMLCanvasElement on browsers that doesn't support OffscreenCanvas.