GroupViT Confusion about dataset setting and label's defination

Dear Author, Thanks for your work! I have some questions about the code: (1）Dataset splitting Assume we have one custom dataset called Seg_Caption Dataset which has segmentation mask annotation and caption annotation. Can we split the dataset based on the following way: Specifically, Seg_Caption_train just includes the input image and caption, the Seg_Caption_val includes the input image, caption, and segmentation mask. The segmentation mask is used to compare the model's segmentation prediction and the ground truth segmentation mask. And we just enable the seg task during the evaluation.

train: - Seg_Caption_train val: - Seg_Caption_val - evaluate: task: - seg

(2) label's definition in the code labels = torch.arange(batch_size, dtype=torch.long, device=image_x.device) + batch_size * dist.get_rank() loss_img = self.cross_entropy(logits_per_img * logit_scale, labels) loss_text = self.cross_entropy(logits_per_text * logit_scale, labels) Does this label make sense? It's hard to understand what such a loss wants to do.

(3) WebDataset format Following this video tutorial: https://www.youtube.com/watch?v=v_PacO-3OGQ I try to use the "tar --sort=name -cf ../dataset.tar ." to convert my dataset into the required format. but the training will stop after the follwing procedure: (main_group_vit.py 284): INFO Train: [0/30][0/750] eta 0:15:46 lr 0.000000 time 1.2622 (1.2622) total_loss 2.3884 (2.3884) loss 0.7358 (0.7358) multi_label_loss 1.6526 (1.6526) grad_norm nan (nan) mem 4184MB Do you know why this will happen?

Thanks for your help! Mengya Xu

Apr 21 '22 14:04 XuMengyaAmy

Hi @XuMengyaAmy

Thanks for your interest in our work.

(1) The training set should be in webdataset format with image text pairs. The val is just for zero-shot image classification evaluation, so it should be in [image, cls] pairs with webdataset format. If you would like to evaluate on a custom segmentation dataset, please follow MMSegmentation dataset prepare tutorial

(2) It follows CLIP style here. You may read this paper for more details.

(3) Please follow data convert section to prepare your dataset, for example, gcc3m.

Apr 23 '22 01:04 xvjiarui

Thank you for the answer. I have read the CLIP paper, but could not find any information that would address the following question. Regarding the point (2), is this definition of loss based on the assumption that the labels corresponding to samples within each batch are different? Would this loss still be valid if I have a batch size of 128 with only two classes for training?

Jun 14 '23 18:06 aryaabdi

Thank you for the answer. I have read the CLIP paper, but could not find any information that would address the following question. Regarding the point (2), is this definition of loss based on the assumption that the labels corresponding to samples within each batch are different? Would this loss still be valid if I have a batch size of 128 with only two classes for training?

Have you resolved the question (2)? I also have a same confusion.

Nov 23 '23 02:11 YLiu-creator

My understanding is that the CLIP-based loss is effective if the labels (entities) within the training batch are different. And that seems to be the cases with the training datasets used in groupViT. For example the gcc3m has 16K entities (classes, or labels). I had to find it the hard way by running many experiments with limited number of entities. Let me know if you had a different understanding/observation.

Feb 28 '24 14:02 aryaabdi