transformers
transformers copied to clipboard
Add ViTPose
What does this PR do?
This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.
Here's a demo notebook - note that the API might change: https://colab.research.google.com/drive/15_3gjcC0wtKSH85k76zewt81eUJIEWWA?usp=sharing.
To do:
- [x] get rid of cv2 dependency (?)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@NielsRogge Hi Niels! Does this current PR works properly? (I want to do some test for this)
Thanks for working on this @NielsRogge!
I can see there's a few bits unfinished e.g. tests. Is there a particular bit of code you'd like me to look at for a maintainers perspective?
For the issue description, there's a related model request is here: #24915
@NielsRogge I'm unsubscribing atm, so that I don't get notifications on every new push. You just need to ping me again with my username when it's ready for review and I'll get notified
It would be great to have a first round of review as the PR is in a ready state. @amyeroberts
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@amyeroberts this implementation requires preprocessing functions such as cv2.warpAffine
and cv2.GaussianBlur
which do not have a non-cv2 equivalent implementation. Shoud we proceed with adding a requires_cv2
dependency checker?
which do not have a non-cv2 equivalent implementation
An affine transformation is just a linear one, so it should be easy for us to add a numpy equivalent. OpenCV have an example on how to add in numpy: https://theailearner.com/tag/cv2-warpaffine/
For gaussian blur, I think there's a scipy guassian filter we could use.
Another equivalent is just adding the fast image processor equivalent for the model and using torchvision.
We should try to avoid adding another vision library dependency as much as possible
Inference_with_ViTPose_for_human_pose_estimation_ipynb.ipynb.zip
@amyeroberts Hi amy this is the overall descriptions easy for you to look in major change (Skipped explaining minor changes like typo.)
- https://github.com/huggingface/transformers/pull/30530/commits/1b439e20be2c07527ea89a54e4e0a749facad6be
- Made the function to accept more variable
- https://github.com/huggingface/transformers/pull/30530/commits/bbd534c53a9afc335875d5a142372a3728303d7d
- Add more test, originally the
prepare_img
was just for the cat image which is not appropriate for this model (Since it is human pose estimation). Changed to human image in COCO. - Also made the output return of
post_process
result as tensor.
IMO, we can some more testing function if it is needed.
cc. @NielsRogge For additional review request
@amyeroberts @NielsRogge Ping for the reminder
@NielsRogge Requesting for second review (Need some answer for 2-3 questions)
Looking forward to this @SangbumChoi thanks for doing this @amyeroberts you're a great code reviewer
@amyeroberts @NielsRogge Fortunately, every comments from both review was clear enough me to understand and modify to appropriate way. (Also fixed ViT -> Vit) Asking for second final review :)
current CI fail is unrelated for this PR
@amyeroberts Hi amy thanks again for the detail review. I have addressed all the suggestions except for removing MoE
. We can remove the MoE logic but it is designed in the original code. Do we really need to remove it? Otherwise it is good to go.
cc. @qubvel @NielsRogge For to request PR review instead of amy
@SangbumChoi sorry for the delay, I will review this week!