ohwi
Results
3
comments of
ohwi
I saw little difference at the backbone. The paper uses ViT and this work uses CNN.
Thank you for your reply. I think I understand the structure of your work. Thank you!!
> PS: Recent research shows that doing "Object Detection" prior to "Image Captioning" doesn't bring any additional improvement, instead it will just increase complexity. Hi. Would you let me know...