ramalama Idea: Implement configurations encompassing huggingface transformers, diffusers frameworks and underlying back-end options.

"The RamaLama project's goal is to make working with AI boring through the use of OCI containers...."

I'm very impressed by the capability / flexibility offered by ramalama and the compelling opportunity which it has to fill a presently large gap in facilitating ML model acquisition / use / sysadmin / security.

I have enjoyed using several of the technologies involved ('containers' ecosystem, podman, etc. etc.) and also ML models / frameworks such as llama.cpp, openvino, huggingface transformers, huggingface diffusers, pytorch, OCI, CDI, et. al. I think there's great need and opportunity to have such synergistic unification of tools (containers, linux tool suite + model resources + inference framework resources to facilitate / enable "making it boring" to enable ML models as 'applications' use cases. But of course there's enough complexity of implementation details that it makes the end user's use case (something everybody needs) hard enough to do sysadmin / devops / security for that only few developers reinvent the wheel to set this all up bespoke. ramalama is on track for solving a lot of the "implementation details" and encapsulating them.

If huggingface transformers was enabled as a basis inference framework it would provide essentially instant (on release day) support for something like 95%+ of open weights LLM models with excellent support. This broadens the reach of e.g. llama.cpp, ollama, onnx, openvino since only a small subset of released models are supported by such inference engines (significant independent community / project developer engineering is needed for some / many models). It also provides a faster path for users to use the models ASAP as opposed to possibly having no ramalama / llama.cpp / ollama etc. support to run them for perhaps many months for the subset of models that are ultimately supported by such non-HF-transformers / pytorch / ... based engines.

Use case examples / documentation etc. for how to use almost all newly released models is almost ubiquitously provided first, foremost, and often exclusively in the context of "how to use this model with huggingface transformers".

The python based HF framework pipeline etc. for a given model / class of models usually exposes all major or possible / supported use cases for inferencing the models.

That's unlike e.g. llama.cpp etc. runtimes where it's more common to have some bumpy roads in terms of prompt templates, tokenization, metadata, quantization vs. quality etc. since non HF transformers engines are not usually directly supported by the model makers and the conversions / mappings / derivative support can be complex.

"pip" or similar installations are able to be used from trustworthy upstream repos / vendors to install the dependencies to run models with HF transformers, acquire back end configurations needed to accelerate inference for nvidia / amd / intel et. al. GPUs. The required packages / modules are commonly available in major linux distribution repos or from official and major upstream vendor sources (vendor's container registry images, vendor's sites / hubs...).

The same things said above about the benefits for running ML models based on HF transformers apply but moreso for huggingface diffusers and the diffusion (et.al.) models it supports. One does not see e.g. llama.cpp AFAIK supporting diffusion model inference / serving, and HF diffusers is probably the primary (besides ONNX or apple / qualcomm / samsung specific options) way open diffusion models are able to be inferenced.

So it could be a low-effort integration to package / containerize HF transformers + HF diffusers and enable relevant backend components (e.g. pytorch, openvino) to enable CPU / GPU based inference on multiple platforms but garner the benefit of ramalama's unified command line tools / OCI / container support value added etc.

As mentioned the openvino model format / support and inference engine is also able to be leveraged by the very prominent and broadly relevant / contemporary huggingface transformers / diffusers model inference framework projects as one possible back end via huggingface's 'optimum-intel' project and via pytorch support for an alternative backend (which in turn supports xpu (intel GPUs) directly, cuda / nvidia GPUs, and many others.

So as an intersecting tangent (distinct but overlapping feature) I think it's very highly commendable to consider looking at providing some ramalama containerized inference configurations for HF diffusers, HF transformers and the ability of those to work with the various backend inference chains / platforms they indirectly support e.g. optimum-intel, openvino, pytorch, onnx, etc.

https://huggingface.co/docs/optimum/main/en/intel/index

https://huggingface.co/docs/transformers/perf_infer_gpu_one

IME most models (LLMs, diffusion models) released in the past year+ time frames have very prompt if not same release day support for inference using HF transformers / diffusers and very commonly underlying pytorch based inference execution support. This inference opption is usually among the best documented, best supported, and most flexible wrt. configuration of all others.

Other inference engine / framework based support / configurations for released models may in a moderately small subset of model type / release cases may eventually (often months later) be independently developed / enabled e.g. llama.cpp, but in many cases many model categories and specific releases simply are historically unlikely to be supported even after 1-2 year+ time frames by some inference runtimes such as llama.cpp, onnx, openvino, whereas usually they'll have excellent "at launch" pytorch / huggingface inference support & documentation.

So for those reasons I can envision it to be a great boon in capability and addressing broad / contemporary user use cases for ramalama ("The RamaLama project's goal is to make working with AI boring through the use of OCI containers.") to support inference / serving configurations based on huggingface / pytorch / openvino framework / engine options all of which can very well coexist with the OCI / container / podman / command set etc. key enabling aspects of ramalama.

https://huggingface.co/docs/transformers/index

https://huggingface.co/docs/diffusers/index

Originally posted by @ghchris2021 in #607

Feb 23 '25 01:02 ghchris2021

Seconding this. There are many platforms that will run popular models but none that support transformers & sentence-transformers as a fallback (Ollama claims to support it, but AFAIK, only if it's in a GUFF format which llama.cpp understands). As it's a fallback, it does not even need to be super optimized. This would be invaluable for prototyping and small scale projects