translate icon indicating copy to clipboard operation
translate copied to clipboard

Human Pose Viewer - Faster, Prettier, Generic Implementation

Open AmitMY opened this issue 3 years ago • 0 comments

Problem

The current human pose viewer looks human-ish and is in high definition (768x768); however, it is not pleasant to the eyes and works slowly (70ms~ with WebGPU and 3070Ti), requiring us to devise better models with fewer resources.

https://user-images.githubusercontent.com/5757359/195432506-9834af4c-6ddc-4fff-a9f6-4dcbea1abb2b.mp4

Description

We currently use a GAN to train a U-NET from poses to people (https://github.com/sign-language-processing/everybody-sign-now). It is 2D, working frame-by-frame, with an LSTM to share the context state at the bottleneck of the U-NET.

Instead of our current U-NET, at around 100MB (float16), and many operations, we could use smaller less accurate implementations, like SqueezeNet. As this will be faster but less accurate, we could train multiple networks, similar to a diffusion process, to iteratively improve on the quality, given context. Then in inference time, as we strive for real-time translation, we could perform as many iterations as possible to achieve the current frame rate based on heuristics.

Finally, there needs to be an upscaling model. 768x768 might be unneeded, and 512x512 may be enough. This means that the original U-NETs don't need to be 256x256 as well. It may as well be that we use 64x64 latent space tensors to optimize and the "upscaling" model makes them into a nice video, or that we generate the face, body, and hands independently in low resolution (64x64), then stitch them on top of each other, and "upscale" to fix the color and imperfections.

What's clear is that there needs to be:

  • [ ] Complex training pipeline, to train all these models, based on real and predicted data
  • [ ] Complex inference pipeline, to estimate how much inference we can perform on a given machine to strike a looks/speed balance

Alternatives

Striving to work on mobile devices, we can ignore the web platform and only focus on optimizations for specific silicon. (https://github.com/sign/translate/issues/25)

Another optimization route is using batches to speed up inference. On WebGL, they don't seem to matter much, but on WebGPU they seem to result in a 5-10x speed improvement, based on batch size etc. We need to "learn" how much we can batch on a given device to still keep the real-time performance and how to buffer many frames as fast as possible,

AmitMY avatar Oct 12 '22 19:10 AmitMY