[P1] Compatibility with tooling that expects a HF transformer model
I'm raising the issue that in terms of "production readyness" (statet goal) pyreft, designed as a very thoughtful library, will need to work together with tooling that expects a loadable vanilla transformer model. A real world reproducible example is loading a pyvene trained model with https://github.com/outlines-dev/outlines in order to create structured json/ schema following outputs.
While the model can be accessed via pyref_model.model - it is not loadable, and in any case one tool would miss the other's functionality when loaded this way. What would be a advisable strategy to integrate with other tooling? May I suggest also different backend engines (e.g. vllm, ollama, llama.cpp) will need to have have interfaces to pyreft. Maybe I'm overseeing some documentation here but I'm unsure how to proceed.
Is merging a pyvene intervention into the base model possible or is pyvene/pyreft more of an active component that will require code changes in any case?
Hey! So:
- We got similar questions on Twitter about accelerating inference with different backends (
vllm,mlx, etc.) Currently,pyveneis a major dependency for which no alternative exists: it manages thetorchhooks that are used to intervene on hidden representations at the token-level inpyreft. To enable support for non-HF and/or non-torchmodels, we would need to replicate somepyvenefunctionality. We have thought about how to do this simply without needing to portpyveneentirely[^thoughts], but it's a long-term software engineering task that we don't immediately have the time/resources/people for. Maybe in the summer oncepyreftis known to be stable for a variety of models + tasks, we will invest time into this. - The LoReFT intervention can't be merged into the base model for two reasons. (1) It is a complex function applied directly to the hidden state, so it operates differently than existing model components (which add to the hidden state via residuals) and so can't be folded into them as far as we can tell. (2) It operates only on some tokens, not all, but model weights are the same for every token.
So overall, using LoReFT in a model requires either torch-style hooking functionality or code changes to the model to support token-level interventions.
[^thoughts]: E.g. we could just load pyvene for the KV-cache population when processing the prompt, and then use the efficient backend for generation. But in the future, we want to support intervention on decoding steps as well which is messier.
assigning with P1 since there is no blocker.
an elegant solution could be providing an import AutoModel from pyreft that encapsulates the hooks while preserving compatibility with other libraries. Is this on a high level possible? If so, I'd be willing to contribute , my interest here lies also in supporting high troughput vllm and per request model switching, both possible with vllm already. They just loads a HF AutoModel in the end.