oumi [Feature] Add support for `Pixmo` vision-language datasets

Feature request

https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b

PixMo is a set of vision-language datasets built by Ai2 and used to train the Molmo family of models.

Motivation / references

Let's add some of PixMo datasets to Oumi under https://github.com/oumi-ai/oumi/tree/main/src/oumi/datasets/vision_language

For example:

https://huggingface.co/datasets/allenai/pixmo-docs
https://huggingface.co/datasets/allenai/pixmo-cap

Documentation: https://oumi.ai/docs/en/latest/resources/datasets/vl_sft_datasets.html

Your contribution

If somebody can volunteer to start this work, I can answer questions and help with testing

Feb 06 '25 23:02 nikg4

I can help with this

Feb 10 '25 05:02 jrwana

I can help with this

Great, thank you ! Assigned to you. If you have any questions please ask on Discord: https://discord.gg/oumi

Feb 10 '25 17:02 nikg4

I was just chatting about this issue on discord. It looks like several of the images in the Pixmo dataset 404...

INFO     oumi:models.py:455 Using the chat template 'qwen2-vl-instruct' specified in model config for model 'Qwen/Qwen2-VL-2B-Instruct'.
INFO     oumi:base_map_dataset.py:82 Creating map dataset (type: PixmoAskModelAnythingDataset) dataset_name: 'allenai/pixmo-ask-model-anything', dataset_path: 'None'...
INFO     oumi:base_map_dataset.py:486 Dataset Info:
    Split: train
    Version: 0.0.0
    Dataset size: 108249896
    Download size: 63295710
    Size: 171545606 bytes
    Rows: 161737
    Columns: ['image_url', 'image_sha256', 'question', 'answer']
INFO     oumi:base_map_dataset.py:425 Loaded DataFrame with shape: (161737, 4). Columns:
image_url       object
image_sha256    object
question        object
answer          object
dtype: object
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(1104, 1176), dtype='float32', id=None) vs Array2D(shape=(3248, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(3248, 1176), dtype='float32', id=None) vs Array2D(shape=(4888, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4888, 1176), dtype='float32', id=None) vs Array2D(shape=(2400, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(2400, 1176), dtype='float32', id=None) vs Array2D(shape=(4900, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4900, 1176), dtype='float32', id=None) vs Array2D(shape=(1564, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(1564, 1176), dtype='float32', id=None) vs Array2D(shape=(4988, 1176), dtype='float32', id=None)!
WARNING  oumi:base_map_dataset.py:224 The pixel_values feature has variable shapes: Array2D(shape=(4988, 1176), dtype='float32', id=None) vs Array2D(shape=(1560, 1176), dtype='float32', id=None)!
INFO     oumi:base_map_dataset.py:311 PixmoAskModelAnythingDataset: features=dict_keys(['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw', 'labels'])
ERROR    oumi:image_utils.py:140 Failed to download image: 'https://img-aws.ehowcdn.com/700x/www.onlyinyourstate.com/wp-content/uploads/2020/09/22587445354_340a0e0a9f_c.jpg'
Traceback (most recent call last):
  File "/Users/joe/Dev/oumi/src/oumi/utils/image_utils.py", line 138, in load_pil_image_from_url
    response.raise_for_status()
  File "/Users/joe/miniconda3/envs/oumi/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://img-aws.ehowcdn.com/700x/www.onlyinyourstate.com/wp-content/uploads/2020/09/22587445354_340a0e0a9f_c.jpg (edited)
February 16, 2025

@xrdaukar What are you your thoughts on adding support for ignoring 404s in dataset parsing? We could add an option along the lines of ignore_404s: bool = False to the constructor of VisionLanguageSftDataset, but we'd have to find a suitable way to handle these 404s in the returned methods.

Feb 16 '25 19:02 taenin

The current Oumi dataset design assumes all examples are valid. I think it's going to take some non-insignificant effort and/or re-design to allow skipping "corrupt" examples (TBH, I'm not 100% sure if it'd be the right thing to do to allow "corrupt" examples).

Some ideas for short-term workarounds:

Let's reach out to Pixmo dataset owners about the missing files (if the 404 issue is non-transient). For example, post the error details under: https://huggingface.co/datasets/allenai/pixmo-docs/discussions
Make a copy of the dataset with bad records removed - until the Pixmo dataset issue is resolved . The dataset_path param can be potentially used to point to alternative dataset location.

Feb 19 '25 18:02 nikg4

Also, it may be possible to use a fraction of the dataset to test your code changes e.g., split='train[10:20]' https://huggingface.co/docs/datasets/v1.11.0/splits.html . This way you may be able to exclude corrupt records for testing purposes until the dataset issue is resolved.

Feb 19 '25 18:02 nikg4

In the pixmo-cap-qa dataset, the messages column has the same info as the question and answer columns but in a different format. Should only the messages column be included to eliminate redundancy, or all 3 columns?
The pixmo-count dataset has integers but there is no integer type for ContentItem. Should they be convert into strings?
In pixmo-count, there are X,Y coordinate points. How should this be encoded? Although these integer tuples could be converted to strings, it would lose information in the process.
The issue says "some" pixmo datasets. There are 8 total. How many should be complete before submitting a pull request?

Feb 22 '25 23:02 jrwana

1 . Use only one of them (whatever format is easier, or more widely used in other subsets). SFT datasets should return Oumi Conversation-s , so you can ignore redundant data.

2 and 3: The pixmo-count dataset is a bit unusual as it contains structured data. I think one way to handle this kind of data is to generate a prompt asking the model to returns JSON Count objects in the picture and return their centroids in this format <JSON example>... .

You're an object detector. Your goal is to detect all objects of same type  in the picture  
and return the number of the objects, their type, and image coordinates (2D centroids). Return the centroids as integers in [0, 1000] range for both X and Y coordinates. 

Please output the result in JSON format with keys 'object_type', 'num_objects', 'object_2d_centroids' "

You should only output the JSON, nothing else. Like this: "
```json"
 <SAMPLE JSON>
``

(Feel free to refine/improve the prompt)

To help with JSON response formatting, you can define pydantic.BaseModel class with respective fields ('object_type', 'num_objects', 'object_2d_centroids'), populate it with some test data and convert it to JSON . If you do so , you can include this Response pydantic class along with your dataset class (either as docstring, or as unused class definition) for self-documentation (Users can also potentially use the pydantic class for constrained/guided decoding :

https://github.com/oumi-ai/oumi/blob/bfeff245afdd1eddf8f71d569c286f8ed9d1601d/src/oumi/core/configs/params/guided_decoding_params.py#L30 )
https://github.com/oumi-ai/oumi/blob/bfeff245afdd1eddf8f71d569c286f8ed9d1601d/tests/unit/inference/test_remote_inference_engine.py#L1221

It's largely up to you. You can start with easier ones. Feel free to split it up into multiple PR-s if needed.

Feb 24 '25 20:02 nikg4

How should I approach making a config file for these datasets? Should I use the MolmoE-1B model as they did in the paper? What parameters should I use?

Feb 25 '25 23:02 jrwana

How should I approach making a config file for these datasets? Should I use the MolmoE-1B model as they did in the paper? What parameters should I use?

Model training configs should be separate from dataset definition i.e., there is no need to submit training config that uses the new dataset (We can't possibly have sample configs for all models/datasets permutations). It's enough to test the new dataset offline. You can use any model you want but it's probably easiest to use one of pre-tested VLM-s defined under https://github.com/oumi-ai/oumi/tree/main/configs/recipes/vision (Molmo is not supported yet: Molmo integration into transformers is WIP https://github.com/oumi-ai/oumi/issues/1400 )

Feb 26 '25 17:02 nikg4

I've tried to run a few vision language models on the Pixmo datasets on an A100 using the configs in recipes and just changing the dataset name. But I get the error:

raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.") ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None, filters=None) doesn't have a 'processor_name' key.

The unit test and integration tests work fine. The changes can be seen here

Any suggestions on how to fix this please?

Mar 06 '25 23:03 jrwana

We usually see this error when your custom dataset isn't properly registered in Oumi REGISTRY. Please check if you added your new dataset to module initialization script: https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/vision_language/init.py

if it doesn't help, any chance you can share your PR to take a look ?

Mar 07 '25 17:03 nikg4

We usually see this error when your custom dataset isn't properly registered in Oumi REGISTRY. Please check if you added your new dataset to module initialization script: https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/vision_language/init.py

if it doesn't help, any chance you can share your PR to take a look ?

Ok thanks I was the able to fix the error thanks to your hint. I submitted the PR.

Mar 08 '25 23:03 jrwana

I just merged the PR. Thanks for your contribution @jrwana !

Apr 10 '25 00:04 wizeng23

oumi oumi copied to clipboard

[Feature] Add support for `Pixmo` vision-language datasets

Feature request

Motivation / references

Your contribution

oumi
oumi copied to clipboard