NielsRogge

Results 388 comments of NielsRogge

Yeah I had a hard time fine-tuning Pix2Struct myself. However looking at your code snippet, when you encode the target sequence: ``` from transformers import Pix2StructProcessor processor = Pix2StructProcessor.from_pretrained("google/pix2struct-base") dummy_target...

Indeed, the loss should go down to 0. I notice 2 things here: * I see label smoothing is used which is pretty uncommon: https://github.com/huggingface/transformers/blob/7579a52b55611ba7651b6d05cba6f45539a6089d/src/transformers/models/pix2struct/modeling_pix2struct.py#L1557 According to PyTorch's [docs](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html): "The...

Damn not sure why I didn't check the code of the loss calculation before training a model myself 🙈 hopefully this will also solve the fine-tuning runs on larger datasets

Cool model, I've contributed X-CLIP in the past: https://huggingface.co/docs/transformers/model_doc/xclip which is an extension of CLIP for video-language pretraining. Looks like CLIP-ViP focuses more on retrieval. Looks like a great candidate...

Yes, ideally GPT-Neo also has integration tests that test exact logit values. However you can see [here](https://github.com/huggingface/transformers/blob/88399476c3892435395618ed37993176dbb0de73/tests/models/gpt_neo/test_modeling_gpt_neo.py#L519) that expected output IDs and generated texts are tested. But in any case...

In that case, you can copy the classes, call them `CLIPVipConfig`, `CLIPVipTextConfig`, etc. and add `Copied from` on top of them. If you then run `make fix-copies` from the root...

Hi @jegork congrats on your amazing contribution! is it ok if we transfer the ViViT checkpoints to the `google` organization on the hub? (assuming they are officially released checkpoints by...

@Narsil what's your opinion on my comment above? e.g. `Pix2StructForConditionalGeneration` solves both image captioning and VQA with the same model, using the same approach. You can, in addition to the...

> Is there any difference in the code between vqa and captioning ? For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2...

Yes we can technically add them to both pipelines, if you are fine with that.