Cerebras inference support

Open neoOpus opened this issue 6 months ago • 1 comments

Hi,

I would like to know if someone worked on making humanify support Cerebras inference, as it is compatible with OpenAI and can be a better alternative in terms of speed and cost?

https://inference-docs.cerebras.ai/resources/openai

Jun 29 '25 13:06 neoOpus

as it is compatible with OpenAI

@neoOpus Have you tried using the humanify openai --baseURL param in the way they suggest?

https://inference-docs.cerebras.ai/resources/openai#configuring-openai-to-use-cerebras-api
- Configuring OpenAI to Use Cerebras API

https://github.com/jehna/humanify/blob/7beba2d32433e58bb77d0e1b0eda01c470fec3e2/src/commands/openai.ts#L20-L24

I'd be interested to hear if you manage to get it to work, and also your feedback on the speed differences, how effective the different models are when used with humanify, etc.

It seems it's also usable via OpenRouter:

https://github.com/jehna/humanify/issues/416
- https://inference-docs.cerebras.ai/resources/openrouter-cerebras
- https://openrouter.ai/provider/cerebras

These seem to be the models currently available:

https://inference-docs.cerebras.ai/introduction

The Cerebras Inference API currently provides access to the following models:

Model Name Model ID Parameters Speed (tokens/s)

Llama 4 Scout llama-4-scout-17b-16e-instruct 109 billion ~2600 tokens/s

Llama 3.1 8B llama3.1-8b 8 billion ~2200 tokens/s

Llama 3.3 70B llama-3.3-70b 70 billion ~2100 tokens/s

Qwen 3 32B* qwen-3-32b 32 billion ~2100 tokens/s

DeepSeek R1 Distill Llama 70B* deepseek-r1-distill-llama-70b 70 billion ~1700 tokens/s

Model Name	Model ID	Parameters	Speed (tokens/s)
Llama 4 Scout	`llama-4-scout-17b-16e-instruct`	109 billion	~2600 tokens/s
Llama 3.1 8B	`llama3.1-8b`	8 billion	~2200 tokens/s
Llama 3.3 70B	`llama-3.3-70b`	70 billion	~2100 tokens/s
Qwen 3 32B*	`qwen-3-32b`	32 billion	~2100 tokens/s
DeepSeek R1 Distill Llama 70B*	`deepseek-r1-distill-llama-70b`	70 billion	~1700 tokens/s

The pricing:

https://inference-docs.cerebras.ai/support/pricing

Pricing
Our free tier supports a context length of 8,192 tokens. For all supported models, we also offer context lengths up to 128K upon request.

https://inference-docs.cerebras.ai/support/pricing#exploration-tier-pricing

Model Speed Input Output

Llama 4 Scout ~2600 tokens/s $0.65/M tokens $0.85/M tokens

Llama 3.1 8B ~2200 tokens/s $0.10/M tokens $0.10/M tokens

Llama 3.3 70B ~2100 tokens/s $0.85/M tokens $1.20/M tokens

Qwen 3 32B ~2100 tokens/s $0.40/M tokens $0.80/M tokens

Deepseek R1 Distill Llama 70B ~1700 tokens/s $2.20/M tokens $2.50/M tokens

Model	Speed	Input	Output
Llama 4 Scout	~2600 tokens/s	$0.65/M tokens	$0.85/M tokens
Llama 3.1 8B	~2200 tokens/s	$0.10/M tokens	$0.10/M tokens
Llama 3.3 70B	~2100 tokens/s	$0.85/M tokens	$1.20/M tokens
Qwen 3 32B	~2100 tokens/s	$0.40/M tokens	$0.80/M tokens
Deepseek R1 Distill Llama 70B	~1700 tokens/s	$2.20/M tokens	$2.50/M tokens

And the rate limits:

https://inference-docs.cerebras.ai/support/rate-limits
- Rate Limits

And further docs about tool use/function calling:

https://inference-docs.cerebras.ai/capabilities/tool-use
- Tool Use
https://inference-docs.cerebras.ai/agent-bootcamp/section-2
- Tool Use and Function Calling