optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

VideoMAE Model Enabling and Examples

Open pi314ever opened this issue 9 months ago • 2 comments

What does this PR do?

This PR gives examples and proves compatibility for VideoMAE with Gaudi 2 on graph mode and casted to BF16. Tests included ensure compatibility with these and a latency regression test for the graph mode + BF16 model.

No core code changes were made to enable the model.

Before submitting

  • [x] Did you make sure to update the documentation with your changes?
  • [x] Did you write any new necessary tests?

pi314ever avatar Apr 25 '24 17:04 pi314ever

@pi314ever Daniel, pls list out performance benchmark btw Gaudi 2 and A100.

yao-matrix avatar May 14 '24 00:05 yao-matrix

Performance (s) A100 Gaudi2
BF16 0.02548 0.01313
FP32 0.05736 0.01962

Testing setup:

  • 100 sequential model passthroughs of a single video buffer of 16 frames
  • Recorded performance is average time per forward pass

pi314ever avatar May 14 '24 22:05 pi314ever

can you rebase this?

mounikamandava avatar Jun 13 '24 21:06 mounikamandava

Looks good to me

mounikamandava avatar Jun 18 '24 22:06 mounikamandava

I added the patch. I ran the script with multiple video inputs from TempoFunk/webvid-10M:

python run_example.py -bg -w 3 \
    --video_paths https://ak.picdn.net/shutterstock/videos/5629184/preview/stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4 \
    https://ak.picdn.net/shutterstock/videos/21179416/preview/stock-footage-aerial-shot-winter-forest.mp4 \
    https://ak.picdn.net/shutterstock/videos/1063125190/preview/stock-footage-a-beautiful-cookie-with-oranges-lies-on-a-green-tablecloth.mp4 \
    https://ak.picdn.net/shutterstock/videos/1039695998/preview/stock-footage-japanese-highrise-office-skyscrapers-tokyo-square.mp4 \
    https://ak.picdn.net/shutterstock/videos/9607838/preview/stock-footage-zrenjanin-serbia-march-fans-watching-live-concert-bokeh-blur-urban-background-x.mp4 

Which gave outputs

Predicted class for stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4 is sailing and took 3.372e-01 seconds
Predicted class for stock-footage-aerial-shot-winter-forest.mp4 is sled dog racing and took 3.360e-01 seconds
Predicted class for stock-footage-a-beautiful-cookie-with-oranges-lies-on-a-green-tablecloth.mp4 is cooking sausages and took 3.349e-01 seconds
Predicted class for stock-footage-japanese-highrise-office-skyscrapers-tokyo-square.mp4 is marching and took 3.362e-01 seconds
Predicted class for stock-footage-zrenjanin-serbia-march-fans-watching-live-concert-bokeh-blur-urban-background-x.mp4 is slacklining and took 3.358e-01 seconds

The script was loosely adapted from example from the original model card and #783.

pi314ever avatar Jun 20 '24 22:06 pi314ever

Thank you @pi314ever I suggest we adding the example of multiple videos in the README.me file and a note on the adoption strategy.

@regisss Could you kindly provide more review/comments for this?

imangohari1 avatar Jun 20 '24 23:06 imangohari1

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.