transformers Add Arcee model support

Summary

This PR adds support for the Arcee model architecture, laying the groundwork for the upcoming Arcee Foundation Model (AFM) release. Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations.

Model Description

Arcee is architecturally similar to Llama but with the following distinctions:

ReLU² activation: Uses x * relu(x) in MLP layers for improved gradient flow
Optimized for efficiency: Designed with training and inference efficiency in mind
Extended context: Supports extended context with RoPE scaling

Implementation Details

Modular implementation inheriting from Llama components where applicable
Custom ArceeMLP class implementing the ReLU² activation
Full support for all standard transformers features:
- Flash Attention 2, SDPA, and other attention backends
- Gradient checkpointing
- Quantization support (including quantized caches)
- All standard model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)

Testing

Added comprehensive test suite following standard transformers test patterns
Tests for all model variants and core functionality
Specific test for ReLU² activation verification
RoPE scaling tests including YARN support
Tested model forward and backward passes
Verified compatibility with existing architecture
Model loading and forward passes verified
Compatibility with existing infrastructure confirmed

Jun 05 '25 19:06 Crystalcareai

looks good @Crystalcareai! Feel free to ping us whenever you're ready for review. You can also resolve the code style errors with pip install -e .[quality] followed by make style or make fixup

Jun 06 '25 11:06 Rocketknight1

@Rocketknight1 Hey I think I'm ready for a review, Got a lot of the tests passing though still getting some failures that don't seem to be related to my code. Let me know how best I can get this ready for merging.

Jun 11 '25 15:06 Crystalcareai

Hi @Cyrilvallez , Thanks for the feedback, made the requested refactoring changes. Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?

Jun 16 '25 23:06 pranav4501

Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?

Yes, it's already in the PreTrainedModel!

Jun 19 '25 09:06 Cyrilvallez

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 19 '25 10:06 HuggingFaceDocBuilderDev

@Cyrilvallez Thanks for the feedback, removed the pretraining TP from the configurations and added scaffolding for generation integration testing. We will add more robust integration tests and update the checkpoints with the release.

Jun 24 '25 03:06 pranav4501