Yuxuan Wang issues

Results 10 issues of


                                            Yuxuan Wang

[Feature Request] Bulk import image

How to reproduce the behaviour --------- first thanks for developing such a wonderful tool, I am wondering is there any way to import bundles of images automatically? its hard to...

feature request

About the episode

Hi, What is the episode stand for in your dataset paper? I can't find any instruction about this. Thank you a lot.

Inquery about the missing images from ocr_vqa, sam, gpt4v-dataset and ALLaVA-4V

Thank you for your exceptional efforts. Despite this, I find myself at an impasse at the initial stage after several days of work. I have meticulously reviewed each dataset, yet...

Hi, the data(dialogs.h5 and params.josn) seems in conflict

Hi, The extracted features can be found here https://drive.google.com/drive/folders/14zlHmNFkCgptiGttwWKrsaaz5vVUFs00?usp=sharing _Originally posted by @hudaAlamri in https://github.com/batra-mlp-lab/avsd/issues/2#issuecomment-561653010_

Request for NExTQA Dataset Evaluation Prompt and More Results on Challenging Datasets for Fair Comparison

To my knowledge, the videos in NExTQA dataset are relatively short, with an average video length of 44 seconds, and there is a noted static bias[1] in the ActivityNet QA...

Discrepancy in Image ID Alignment Between M3IT and VideoChat2IT

Could you please provide a script or JSON file of the ID map from M3IT to VideoChat2IT? Matching different files can be quite challenging. For example, `coco llava minigpt4 paragraph_captioning...

About the vocabulary inconsistence

For tokenizers in `transformers`, in convention, `tokenizer.vocab_size` [as documented](https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/tokenization_utils.py#L378-L383) is the size of the base vocabulary (without the added tokens). To get the actual vocabulary size, you need to use...

Issue with Clarity in Decoded 16kHz Speech Using SNAC

Thank you for your excellent work; it is impressively fast. However, when I test it with a short 16kHz speech sample, the decoded voice sounds unclear. Is this a normal...

Request of the official results on StreamingBench of the long context setting

The StreamingBench results in your reports (https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) appear to be based on a 60-second setup. I would like to request the official results for a long-context setup, as well as...

Training Code Reimplementation Released

Hi all, We’ve implemented the training code and added vision input support. You can now convert LVLM/LALM/LLM models to OmniLLM. Check out OpenOmniNexus here: https://github.com/OmniMMI/OpenOmniNexus