Otter icon indicating copy to clipboard operation
Otter copied to clipboard

Data issues

Open zcczhang opened this issue 1 year ago • 9 comments

Hi, thanks for the amazing work and released MIMIC-IT! seems there're a few issues:

  • for LLaVA-In-Context, seems meta link here is missing, where I assume it's supposed to be the LAxxx_train.json files? maybe there're misunderstandings, and it seems to me that here does not exclude the LAxx_INS prefix (e.g. cur_image_id.split('_')[-1] for LACONV, LACR_I2I, etc), otherwise LAxx_INS_ prefix is unexpectedly included for reading coco images. and there're some cases that have the key like coco/train2017/000000033471_2.jpg, where no _2 img found?
  • for TV caption, in TVC_instructions.json, seems the image ids do not correspond with the ids in converted TVC.json. There are some repetitive patterns, e.g. TVC_IMG_castle_s07e09_seg02_clip_02_castle_s07e09_seg02_clip_02_00009 or TVC_IMG_s04e13_seg01_clip_00_bbt_s04e13_seg01 such that it requires to rekey by r'(TVC_IMG)_(.+?_clip_[0-9]+)_(.+?_clip_[0-9]+)_([0-9]+)' for both cases
  • for spot difference, probably [:5] here is unexpected, otherwise only 5 examples are used?
  • typo here, seems to be video.VisualStoryTelling

For other datasets, it would be great to release the processed x.json file (I noticed the egg version would be coming soon) as some datasets are too old to acquire/process and some video datasets are large. Thank you!

zcczhang avatar Jun 26 '23 20:06 zcczhang

Thanks for bringing up these issues.

It seems related to convert-it process right? Current convert-it can not generate correct image_ids corresponding to those ids in xx_instructions.json and `xx_train.json.

We first converted our xx.json for internal use, and then back to wrote the "convert-it" to assist users to obtain xx.json from public datasets. However, it seems there might be some potential issues with the IDs during this conversion process. We are currently investigating the matter and appreciate your patience while we address the problem.

updates:

  1. meta link of LLaVA-In-Context is updating: meta

Luodian avatar Jun 26 '23 23:06 Luodian

saw the pr above, just wonder if coco general difference train and instruction json files are available. Thanks!

zcczhang avatar Jun 29 '23 18:06 zcczhang

Hi @Luodian , just wondering when SD (COCO general diffference version) instructions and train configs would be ready in one drive folder?

zcczhang avatar Jul 09 '23 22:07 zcczhang

Hi @Luodian , just wondering when SD (COCO general diffference version) instructions and train configs would be ready in one drive folder?

Hi sorry I didnt see the message last week. The files are already in our side. We may wait @king159 J to do a final check then expectedly release it today.

Luodian avatar Jul 09 '23 23:07 Luodian

Thanks for the quick response!

zcczhang avatar Jul 09 '23 23:07 zcczhang

@pufanyi @king159

Luodian avatar Jul 10 '23 15:07 Luodian

Please let me know when it's ready (and maybe also the E4D egg) for my download!

zcczhang avatar Jul 11 '23 17:07 zcczhang

Please let me know when it's ready (and maybe also the E4D egg) for my download!

Hi COCO Difference instructions/train json have been uploaded and raw image json is uploading now~

Luodian avatar Jul 12 '23 23:07 Luodian

That sounds great thanks! I think I have the image JSON file processed before. Btw will the egg for E4D be available? or is it too large to upload? (another minor btw: I'm not super familiar with one-drive but are there any better suggestions to directly download from the link to the headless server?)

zcczhang avatar Jul 13 '23 00:07 zcczhang