Add Arcee model support
Summary
This PR adds support for the Arcee model architecture, laying the groundwork for the upcoming Arcee Foundation Model (AFM) release. Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations.
Model Description
Arcee is architecturally similar to Llama but with the following distinctions:
- ReLU² activation: Uses
x * relu(x)in MLP layers for improved gradient flow - Optimized for efficiency: Designed with training and inference efficiency in mind
- Extended context: Supports extended context with RoPE scaling
Implementation Details
- Modular implementation inheriting from Llama components where applicable
- Custom ArceeMLP class implementing the ReLU² activation
- Full support for all standard transformers features:
- Flash Attention 2, SDPA, and other attention backends
- Gradient checkpointing
- Quantization support (including quantized caches)
- All standard model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)
Testing
- Added comprehensive test suite following standard transformers test patterns
- Tests for all model variants and core functionality
- Specific test for ReLU² activation verification
- RoPE scaling tests including YARN support
- Tested model forward and backward passes
- Verified compatibility with existing architecture
- Model loading and forward passes verified
- Compatibility with existing infrastructure confirmed
looks good @Crystalcareai! Feel free to ping us whenever you're ready for review. You can also resolve the code style errors with pip install -e .[quality] followed by make style or make fixup
@Rocketknight1 Hey I think I'm ready for a review, Got a lot of the tests passing though still getting some failures that don't seem to be related to my code. Let me know how best I can get this ready for merging.
Hi @Cyrilvallez , Thanks for the feedback, made the requested refactoring changes. Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?
Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?
Yes, it's already in the PreTrainedModel!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@Cyrilvallez Thanks for the feedback, removed the pretraining TP from the configurations and added scaffolding for generation integration testing. We will add more robust integration tests and update the checkpoints with the release.