What does this PR do?

This PR gives examples and proves compatibility for VideoMAE with Gaudi 2 on graph mode and casted to BF16. Tests included ensure compatibility with these and a latency regression test for the graph mode + BF16 model.

No core code changes were made to enable the model.

Before submitting

[x] Did you make sure to update the documentation with your changes?
[x] Did you write any new necessary tests?

Apr 25 '24 17:04 pi314ever

@pi314ever Daniel, pls list out performance benchmark btw Gaudi 2 and A100.

May 14 '24 00:05 yao-matrix

Performance (s) A100 Gaudi2

BF16 0.02548 0.01313

FP32 0.05736 0.01962

Performance (s)	A100	Gaudi2
BF16	0.02548	0.01313
FP32	0.05736	0.01962

Testing setup:

100 sequential model passthroughs of a single video buffer of 16 frames
Recorded performance is average time per forward pass

May 14 '24 22:05 pi314ever

can you rebase this?

Jun 13 '24 21:06 mounikamandava

Looks good to me

Jun 18 '24 22:06 mounikamandava

I added the patch. I ran the script with multiple video inputs from TempoFunk/webvid-10M:

python run_example.py -bg -w 3 \
    --video_paths https://ak.picdn.net/shutterstock/videos/5629184/preview/stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4 \
    https://ak.picdn.net/shutterstock/videos/21179416/preview/stock-footage-aerial-shot-winter-forest.mp4 \
    https://ak.picdn.net/shutterstock/videos/1063125190/preview/stock-footage-a-beautiful-cookie-with-oranges-lies-on-a-green-tablecloth.mp4 \
    https://ak.picdn.net/shutterstock/videos/1039695998/preview/stock-footage-japanese-highrise-office-skyscrapers-tokyo-square.mp4 \
    https://ak.picdn.net/shutterstock/videos/9607838/preview/stock-footage-zrenjanin-serbia-march-fans-watching-live-concert-bokeh-blur-urban-background-x.mp4

Which gave outputs

Predicted class for stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4 is sailing and took 3.372e-01 seconds
Predicted class for stock-footage-aerial-shot-winter-forest.mp4 is sled dog racing and took 3.360e-01 seconds
Predicted class for stock-footage-a-beautiful-cookie-with-oranges-lies-on-a-green-tablecloth.mp4 is cooking sausages and took 3.349e-01 seconds
Predicted class for stock-footage-japanese-highrise-office-skyscrapers-tokyo-square.mp4 is marching and took 3.362e-01 seconds
Predicted class for stock-footage-zrenjanin-serbia-march-fans-watching-live-concert-bokeh-blur-urban-background-x.mp4 is slacklining and took 3.358e-01 seconds

The script was loosely adapted from example from the original model card and #783.

Jun 20 '24 22:06 pi314ever

Thank you @pi314ever I suggest we adding the example of multiple videos in the README.me file and a note on the adoption strategy.

@regisss Could you kindly provide more review/comments for this?

Jun 20 '24 23:06 imangohari1

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jul 10 '24 20:07 HuggingFaceDocBuilderDev

optimum-habana
optimum-habana copied to clipboard

VideoMAE Model Enabling and Examples

What does this PR do?

Before submitting

optimum-habana optimum-habana copied to clipboard

VideoMAE Model Enabling and Examples

What does this PR do?

Before submitting

optimum-habana
optimum-habana copied to clipboard