transformers PerceptionLM

This PR implements PerceptionLM released by Meta: https://github.com/facebookresearch/perception_models

pytest tests/models/perception_lm/*.py results below. 16 fail/ 151 pass failed tests can be roughly grouped into these categories:

my test env doesn't have full internet access ( 'Connection aborted.')
PE from timm doesn't work with meta tensors.
weigth init: e.g., PE weights are not initialized through _init_weights.
Explictly disabled behavior, providing both pixel_values and inputs_embeds to the model.
Not implemented error.

======================================================================================= short test summary info ========================================================================================
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_can_be_initialized_on_meta - RuntimeError: Tensor.item() cannot be called on meta tensors
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_can_init_all_missing_weights - AssertionError: False is not true : The following keys are not properly handled by `_init_weights()`:
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_can_load_with_meta_device_context_manager - RuntimeError: Tensor.item() cannot be called on meta tensors
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_generate_from_inputs_embeds_0_greedy - ValueError: You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_generate_from_inputs_embeds_1_beam_search - ValueError: You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_generate_from_inputs_embeds_with_static_cache - ValueError: You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_initialization - AssertionError: -0.00014135599485598505 not found in [0.0, 1.0] : Parameter model.vision_tower.eva_pe.cls_token of model <class 'transformers.models.perception_lm.modeling_perception_lm.Perceptio...
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_model_get_set_embeddings - NotImplementedError
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_model_outputs_equivalence - AttributeError: 'tuple' object has no attribute 'to_tuple'
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_resize_embeddings_untied - NotImplementedError
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_resize_tokens_embeddings - NotImplementedError
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_sdpa_can_dispatch_composite_models - IndexError: list index out of range
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_tie_model_weights - NotImplementedError
FAILED tests/models/perception_lm/test_processor_perception_lm.py::PerceptionLMProcessorTest::test_apply_chat_template_video_0 - requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
FAILED tests/models/perception_lm/test_processor_perception_lm.py::PerceptionLMProcessorTest::test_apply_chat_template_video_1 - requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
FAILED tests/models/perception_lm/test_processor_perception_lm.py::PerceptionLMProcessorTest::test_apply_chat_template_video_frame_sampling - requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
================================================================== 16 failed, 151 passed, 70 skipped, 3 warnings in 75.47s (0:01:15) ===================================================================
NCCL version 2.26.2+cuda12.2

Apr 29 '25 23:04 shuminghu

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Apr 29 '25 23:04 github-actions[bot]

Yay, super excited to get the model shipped! I know it is early to review, I noticed the model doesn't have a modular file yet. I recommend to use modular transformers to add the model, it will allow you to inherit from any similar model in transformers and you won't have to rewrite the whole class

Also it makes the review process easier and faster, since we see what are the main differences between PE and other existing model 😉

I see. Let me take a look. The classes here were added automatically (first commit: plm template) from running this command:

transformers-cli add-new-model-like
What is the model you would like to duplicate? Please provide the lowercase `model_type` (e.g. roberta): llava
What is the name (with no special casing) for your new model in the paper (e.g. RoBERTa)? PerceptionLM
What identifier would you like to use for the `model_type` of this model?  [perceptionlm] perception_lm
What lowercase name would you like to use for the module (folder) of this model?  [perceptionlm] perception_lm
What prefix (camel-cased) would you like to use for the model classes of this model (e.g. Roberta)?  [PerceptionLM] 
What prefix (upper-cased) would you like to use for the constants relative to this model?  [PERCEPTIONLM] PERCEPTION_LM
What will be the name of the config class for this model?  [PerceptionLMConfig] 
Please give a checkpoint identifier (on the model Hub) for this new model (e.g. facebook/FacebookAI/roberta-base): facebook/Perception-LM-1B
Will your new model use the same processing class as llava (LlamaTokenizer, LlavaProcessor) (yes/no)? no
What will be the name of the tokenizer class for this model?  [PerceptionLMTokenizer] 
What will be the name of the processor class for this model?  [PerceptionLMProcessor] 
Should we add # Copied from statements when creating the new modeling file (yes/no)?  [yes] 
Should we add a version of your new model in all the frameworks implemented by llava (['pt']) (yes/no)?  [yes] 
The constants at the start of the new tokenizer file created needs to be manually fixed. If your new model has a tokenizer fast, you will also need to manually add the converter in the `SLOW_TO_FAST_CONVERTERS` constant of `convert_slow_tokenizer.py`.

Apr 30 '25 17:04 shuminghu

Yeah, that way is correct and usually copies an existing similar model. In this case Llava was written without modular as it was almost the first VLM in transformers :)

May 01 '25 10:05 zucchini-nlp

@shuminghu ping me when it is ready for review and can you mark it PR as "ready for review' as well'

May 26 '25 07:05 zucchini-nlp

Yes. Absolutely. Circling back to this now.

On Mon, May 26, 2025 at 12:33 AM Raushan Turganbay @.***> wrote:

zucchini-nlp left a comment (huggingface/transformers#37878) https://github.com/huggingface/transformers/pull/37878#issuecomment-2908840302

@shuminghu https://github.com/shuminghu ping me when it is ready for review and can you mark it PR as "ready for review' as well'

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/pull/37878#issuecomment-2908840302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWMMF4SLPGNHMYQTRWYENL3AK7V3AVCNFSM6AAAAAB4EIC6R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMBYHA2DAMZQGI . You are receiving this because you were mentioned.Message ID: @.***>

May 27 '25 21:05 shuminghu

For the failing integration tests that a timm model isn't found, cc @ydshieh . Which version on timm do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone

Hi, it doesn't seems just integration tests but all tests, if we are talking about

RuntimeError: Unknown model (vit_pe_core_large_patch14_336)

In the runner, we have

timm==1.0.15

(this info. could be find in the job step Show installed libraries and their versions)

However, we can not use a model like vit_pe_core_large_patch14_336 in non-slow tests (for circleci for example). Usually with HF models, we set a config with very small values for some attributes, and use that config to create a very tiny model on the fly.

For integration tests, we can use it.

Jun 17 '25 09:06 ydshieh

Right, we need to modify non-integration tests @shuminghu . And for the integration ones, do we need to add a decorator for timm version so it doesn't flood nightlies with failures or do we update timm in runners?

Jun 17 '25 09:06 zucchini-nlp

vit_pe_core_large_patch14_336 Here is architecture template. But actual number of layers and dims are small numbers not ViT-L scale in non-integration tests.

On Tue, Jun 17, 2025 at 2:31 AM Raushan Turganbay @.***> wrote:

zucchini-nlp left a comment (huggingface/transformers#37878) https://github.com/huggingface/transformers/pull/37878#issuecomment-2979642675

Right, we need to modify non-integration tests @shuminghu https://github.com/shuminghu . And for the integration ones, do we need to add a decorator for timm version so it doesn't flood nightlies with failures or do we update timm in runners?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/pull/37878#issuecomment-2979642675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWMMF5VXCCHJGODU7YWDPL3D7N7NAVCNFSM6AAAAAB4EIC6R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZZGY2DENRXGU . You are receiving this because you were mentioned.Message ID: @.***>

Jun 17 '25 15:06 shuminghu

For the failing integration tests that a timm model isn't found, cc @ydshieh . Which version on timm do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone

Hi, it doesn't seems just integration tests but all tests, if we are talking about

RuntimeError: Unknown model (vit_pe_core_large_patch14_336)

In the runner, we have

timm==1.0.15

(this info. could be find in the job step Show installed libraries and their versions)

However, we can not use a model like vit_pe_core_large_patch14_336 in non-slow tests (for circleci for example). Usually with HF models, we set a config with very small values for some attributes, and use that config to create a very tiny model on the fly.

For integration tests, we can use it.

@ydshieh vit_pe_core_large_patch14_336 is actually not used in non-integration test despite the naming. With latest TimmWrapperConfig support, it is specified as following in non-integration tests, where architecture is merely a template and model dim and depth is specified as model_args. But yea, all of this requires timm's source code up-to-date to a week ago.

        vision_config={
            "architecture": "vit_pe_core_large_patch14_336",
            "model_args": {
                "embed_dim": 64,
                "img_size": (14, 14),
                "depth": 2,
                "global_pool": "",
                "use_post_transformer_norm": False,
                "init_values": 0.1,
                "ref_feat_shape": (1, 1),
            },
        },

Jun 26 '25 05:06 shuminghu

Happy to upgrade timm in docker if @zucchini-nlp and @qubvel confirm it is necessary.

Does this mean we also have to pin timm version in setup.py so users will have the correct timm version?

(And this requirement is specific to this new model or we need it anyway in general?)

Jun 27 '25 13:06 ydshieh

We also need the latest timm for gemma3n, because mobilenetv5 was added just recently, but not sure if we should pin reqs strict to the latest version, maybe just add version check with correct error message

Jun 27 '25 13:06 qubvel

we have timm==1.0.15 at this moment, and we have v1.0.16 released 16 hours ago).

@shuminghu Are the failing tests caused by the timm==1.0.15 ...?

Jun 27 '25 13:06 ydshieh

we have timm==1.0.15 at this moment, and we have v1.0.16 released 16 hours ago).

@shuminghu Are the failing tests caused by the timm==1.0.15 ...?

@ydshieh Right. v1.0.16 would be good for me for CI.

From Release v1.0.16 change log: ... Add EVA ViT based PE (Perceptual Encoder) impl by @rwightman in https://github.com/huggingface/pytorch-image-models/pull/2487

--- upate ---- Thanks @ydshieh! I just saw CI is passing. So it must have been updated.

Jun 27 '25 16:06 shuminghu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 30 '25 07:06 HuggingFaceDocBuilderDev

Thanks @ydshieh! I just saw CI is passing. So it must have been updated.

no, not me. We have "timm<=1.0.11", in setup.py but yet

RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu

in docker file, and the docker image is built in a daily basis. That is why it works now 😅

Glad CI is ✅ now!

Jun 30 '25 13:06 ydshieh

base_model_prefix="model" fixed it! :)

On Wed, Jul 2, 2025 at 9:22 AM Raushan Turganbay @.***> wrote:

@.**** commented on this pull request.

In utils/check_repo.py https://github.com/huggingface/transformers/pull/37878#discussion_r2180485383 :

@@ -301,6 +301,7 @@ "OwlViTForObjectDetection", "PatchTSMixerForPrediction", "PatchTSMixerForPretraining",

"PerceptionLMModel",

Weird, it should be loadable and it'll allow us native integration with vLLM. Probably the base model prefix isn't defined, because in other VLMs it was removed on purpose Having a base_model_prefix="model" should make ot load with both classes

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/pull/37878#discussion_r2180485383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWMMF3LJRWD6O72QYZNY6L3GQBNZAVCNFSM6AAAAAB4EIC6R6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNZZHAZTMMJQGU . You are receiving this because you were mentioned.Message ID: @.***>

Jul 02 '25 16:07 shuminghu

run-slow: perception_lm

Jul 08 '25 13:07 Cyrilvallez

This comment contains run-slow, running the specified jobs:

models: ['models/perception_lm'] quantizations: [] ...

Jul 08 '25 13:07 github-actions[bot]

run-slow: perception_lm

Jul 11 '25 08:07 Cyrilvallez

This comment contains run-slow, running the specified jobs:

models: ['models/perception_lm'] quantizations: [] ...

Jul 11 '25 08:07 github-actions[bot]

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, perception_lm

Jul 11 '25 08:07 github-actions[bot]