oumi
oumi copied to clipboard
[Feature] Add support for `Pixmo` vision-language datasets
Feature request
https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b
PixMo is a set of vision-language datasets built by Ai2 and used to train the Molmo family of models.
Motivation / references
Let's add some of PixMo datasets to Oumi under https://github.com/oumi-ai/oumi/tree/main/src/oumi/datasets/vision_language
For example:
- https://huggingface.co/datasets/allenai/pixmo-docs
- https://huggingface.co/datasets/allenai/pixmo-cap
Documentation: https://oumi.ai/docs/en/latest/resources/datasets/vl_sft_datasets.html
Your contribution
If somebody can volunteer to start this work, I can answer questions and help with testing
I can help with this
I can help with this
Great, thank you ! Assigned to you. If you have any questions please ask on Discord: https://discord.gg/oumi
I was just chatting about this issue on discord. It looks like several of the images in the Pixmo dataset 404...
INFO oumi:models.py:455 Using the chat template 'qwen2-vl-instruct' specified in model config for model 'Qwen/Qwen2-VL-2B-Instruct'.
INFO oumi:base_map_dataset.py:82 Creating map dataset (type: PixmoAskModelAnythingDataset) dataset_name: 'allenai/pixmo-ask-model-anything', dataset_path: 'None'...
INFO oumi:base_map_dataset.py:486 Dataset Info:
Split: train
Version: 0.0.0
Dataset size: 108249896
Download size: 63295710
Size: 171545606 bytes
Rows: 161737
Columns: ['image_url', 'image_sha256', 'question', 'answer']
INFO oumi:base_map_dataset.py:425 Loaded DataFrame with shape: (161737, 4). Columns:
image_url object
image_sha256 object
question object
answer object
dtype: object
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(1104, 1176), dtype='float32', id=None) vs Array2D(shape=(3248, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(3248, 1176), dtype='float32', id=None) vs Array2D(shape=(4888, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4888, 1176), dtype='float32', id=None) vs Array2D(shape=(2400, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(2400, 1176), dtype='float32', id=None) vs Array2D(shape=(4900, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4900, 1176), dtype='float32', id=None) vs Array2D(shape=(1564, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(1564, 1176), dtype='float32', id=None) vs Array2D(shape=(4988, 1176), dtype='float32', id=None)!
WARNING oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4988, 1176), dtype='float32', id=None) vs Array2D(shape=(1560, 1176), dtype='float32', id=None)!
INFO oumi:base_map_dataset.py:311 PixmoAskModelAnythingDataset: features=dict_keys(['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw', 'labels'])
ERROR oumi:image_utils.py:140 Failed to download image: 'https://img-aws.ehowcdn.com/700x/www.onlyinyourstate.com/wp-content/uploads/2020/09/22587445354_340a0e0a9f_c.jpg'
Traceback (most recent call last):
File "/Users/joe/Dev/oumi/src/oumi/utils/image_utils.py", line 138, in load_pil_image_from_url
response.raise_for_status()
File "/Users/joe/miniconda3/envs/oumi/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://img-aws.ehowcdn.com/700x/www.onlyinyourstate.com/wp-content/uploads/2020/09/22587445354_340a0e0a9f_c.jpg (edited)
February 16, 2025
@xrdaukar What are you your thoughts on adding support for ignoring 404s in dataset parsing?
We could add an option along the lines of ignore_404s: bool = False to the constructor of VisionLanguageSftDataset, but we'd have to find a suitable way to handle these 404s in the returned methods.
The current Oumi dataset design assumes all examples are valid. I think it's going to take some non-insignificant effort and/or re-design to allow skipping "corrupt" examples (TBH, I'm not 100% sure if it'd be the right thing to do to allow "corrupt" examples).
Some ideas for short-term workarounds:
- Let's reach out to Pixmo dataset owners about the missing files (if the 404 issue is non-transient). For example, post the error details under: https://huggingface.co/datasets/allenai/pixmo-docs/discussions
- Make a copy of the dataset with bad records removed - until the Pixmo dataset issue is resolved . The
dataset_pathparam can be potentially used to point to alternative dataset location.
Also, it may be possible to use a fraction of the dataset to test your code changes e.g., split='train[10:20]' https://huggingface.co/docs/datasets/v1.11.0/splits.html . This way you may be able to exclude corrupt records for testing purposes until the dataset issue is resolved.
- In the pixmo-cap-qa dataset, the messages column has the same info as the question and answer columns but in a different format. Should only the messages column be included to eliminate redundancy, or all 3 columns?
- The pixmo-count dataset has integers but there is no integer type for ContentItem. Should they be convert into strings?
- In pixmo-count, there are X,Y coordinate points. How should this be encoded? Although these integer tuples could be converted to strings, it would lose information in the process.
- The issue says "some" pixmo datasets. There are 8 total. How many should be complete before submitting a pull request?
1 . Use only one of them (whatever format is easier, or more widely used in other subsets). SFT datasets should return Oumi Conversation-s , so you can ignore redundant data.
2 and 3: The pixmo-count dataset is a bit unusual as it contains structured data. I think one way to handle this kind of data is to generate a prompt asking the model to returns JSON Count objects in the picture and return their centroids in this format <JSON example>... .
You're an object detector. Your goal is to detect all objects of same type in the picture
and return the number of the objects, their type, and image coordinates (2D centroids). Return the centroids as integers in [0, 1000] range for both X and Y coordinates.
Please output the result in JSON format with keys 'object_type', 'num_objects', 'object_2d_centroids' "
You should only output the JSON, nothing else. Like this: "
```json"
<SAMPLE JSON>
``
(Feel free to refine/improve the prompt)
To help with JSON response formatting, you can define pydantic.BaseModel class with respective fields ('object_type', 'num_objects', 'object_2d_centroids'), populate it with some test data and convert it to JSON . If you do so , you can include this Response pydantic class along with your dataset class (either as docstring, or as unused class definition) for self-documentation (Users can also potentially use the pydantic class for constrained/guided decoding :
- https://github.com/oumi-ai/oumi/blob/bfeff245afdd1eddf8f71d569c286f8ed9d1601d/src/oumi/core/configs/params/guided_decoding_params.py#L30 )
- https://github.com/oumi-ai/oumi/blob/bfeff245afdd1eddf8f71d569c286f8ed9d1601d/tests/unit/inference/test_remote_inference_engine.py#L1221
- It's largely up to you. You can start with easier ones. Feel free to split it up into multiple PR-s if needed.
How should I approach making a config file for these datasets? Should I use the MolmoE-1B model as they did in the paper? What parameters should I use?
How should I approach making a config file for these datasets? Should I use the MolmoE-1B model as they did in the paper? What parameters should I use?
Model training configs should be separate from dataset definition i.e., there is no need to submit training config that uses the new dataset (We can't possibly have sample configs for all models/datasets permutations). It's enough to test the new dataset offline. You can use any model you want but it's probably easiest to use one of pre-tested VLM-s defined under https://github.com/oumi-ai/oumi/tree/main/configs/recipes/vision
(Molmo is not supported yet: Molmo integration into transformers is WIP https://github.com/oumi-ai/oumi/issues/1400 )
I've tried to run a few vision language models on the Pixmo datasets on an A100 using the configs in recipes and just changing the dataset name. But I get the error:
raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.") ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None, filters=None) doesn't have a 'processor_name' key.
The unit test and integration tests work fine. The changes can be seen here
Any suggestions on how to fix this please?
We usually see this error when your custom dataset isn't properly registered in Oumi REGISTRY. Please check if you added your new dataset to module initialization script: https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/vision_language/init.py
if it doesn't help, any chance you can share your PR to take a look ?
We usually see this error when your custom dataset isn't properly registered in Oumi REGISTRY. Please check if you added your new dataset to module initialization script: https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/vision_language/init.py
if it doesn't help, any chance you can share your PR to take a look ?
Ok thanks I was the able to fix the error thanks to your hint. I submitted the PR.
I just merged the PR. Thanks for your contribution @jrwana !