vllm topics

Comprehensive, scalable ML inference architecture using Amazon EKS, leveraging Graviton processors for cost-effective CPU-based inference and GPU instances for accelerated inference. Guidance provides...

aws-solutions-library-samples

agentic-ai

agentic-workflow

huggingface

inference

tiny-llm

3.4k

Stars

230

Forks

3.4k

Watchers

A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.

skyzh

course

large-language-model

llm

python

Chinese-MedQA-Qwen2

34

Stars

5

Forks

34

Watchers

基于Qwen2+SFT+DPO的医疗问答系统，项目中使用了LLaMA-Factory用于训练，fastllm和vllm用于推理，

NJUxlj

dpo

qwen2

vllm

gpt_server

235

Stars

21

Forks

235

Watchers

gpt_server是一个用于生产级部署LLMs、Embedding、Reranker、ASR、TTS、文生图、图片编辑和文生视频的开源框架。

shell-nlp

asr

embedding

fastchat

function-calling

easy-model-deployer

74

Stars

19

Forks

74

Watchers

Deploy open-source LLMs on AWS in minutes — with OpenAI-compatible APIs and a powerful CLI/SDK toolkit.

aws-samples

comfyui-workflow

deepseek

deepseek-r1

deepseek-v3

Mooncake

4.3k

Stars

446

Forks

4.3k

Watchers

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

kvcache-ai

disaggregation

inference

kvcache

llm

gpustack

4.1k

Stars

412

Forks

4.1k

Watchers

GPU cluster manager for optimized AI model deployment

gpustack

ascend

cuda

deepseek

distributed-inference

GPTQModel

902

Stars

130

Forks

902

Watchers

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

ModelCloud

gptq

optimum

peft

quantization