seldon-inference-pipelines
seldon-inference-pipelines copied to clipboard
Examples of inference pipelines implemented using https://github.com/SeldonIO/seldon-core
Description
This repo contains a set of practice inference graphs implemented using Seldon core inference graph. Inference graphs in seldon folder are implemented using Seldon 1st gen custom python package and pipelines in mlserver folder are implemented using Serving Custom Model Seldon's newer serving platform mlserver and Seldon Inference Graph.
NOTE: This repo is shared for learning purposes, some of the pipeliens implemented here might not have a real-world usecases and they are not fully tested.
Pull requests, suggestions and completing the list of pipelines for future implementation are highly appreciated.
Inference graphs implemented using 1st gen Seldon
Pipelines from InferLine: latency-aware provisioning and scaling for prediction serving pipelines
- Cascade
- Ensemble
- Preprocess
- Vidoe Monitoring

and the following pipelines:

- audio-qa: Audio to text -> Question Answering
- audio-sent: Audio to text -> Sentiment Analysis
- nlp: language identification -> translate fr to Eng -> summerisation
- sum-qa: Summerisation -> Question Answering
- video: Object Detection -> Object Classification
Inference graphs implemented using MLServer
- audio-qa: Audio to text -> Question Answering
- audio-sent: Audio to text -> Sentiment Analysis
- nlp: language identification -> translate fr to Eng -> summerisation
- sum-qa: Summerisation -> Question Answering
- video: Object Detection -> Object Classification
DockerHub
Pre-built container images are also available here. Therefore if you are just trying out, you can deploy yaml files on your K8S cluster the way they are.
Relevant Projects
Some of the academic and industrial relevant projects that could be used as a source of Inference Pipelines for future implementations.
System's related Academic Papers
- InferLine: latency-aware provisioning and scaling for prediction serving pipelines
- GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks
- FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees
- Rim: Offloading Inference to the Edge
- Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
- Scrooge: A Cost-Effective Deep Learning Inference System
- Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
- VideoEdge: Processing Camera Streams using Hierarchical Clusters
- Live Video Analytics at Scale with Approximation and Delay-Tolerance
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse
ML Theory related Academic Papers
- On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems
- Fixes That Fail: Self-Defeating Improvements in Machine-Learning Systems
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- PaLM: Scaling Language Modeling with Pathways
- Language Model Cascades
Software Engineering related Academic Papers
- Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems
- PromptChainer: Chaining Large Language Model Prompts through Visual Programming
- 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
- Feature Interactions on Steroids: On the Composition of ML Models
Industrial Projects
Load Tester
This repo also includes a small async load tester for sending workloads to the models/pipeliens. You can find it under async load tester folder.
Sources of Models
Audio and Text Models
Source:
For Image Models
Source:
Please give a star if this repo helped you learning somthing new :)
TODOs (sorted by priority)
- Re-implement pipelines in Seldon V2
- Add an example of using shared models in pipeliens using V2
- Example of multi-model request propagation
- Example implementation using Nvidia Triton Server as the base containers instead of MLServer
- Examples of model load/unload in Triton and MLServer
- GPU examples with fractional GPUs
- Send image/audio/text in a compresssed fromat
- Add performance evaluation scripts and load tester
- Complete Unfinished pipelines
- Examples of using Triton Client for interacting with MLSserver examples
- Examples of using Triton Inference Server as the serving backend
- Pipelines implementation in upcoming Seldon core V2
- Examples of Integration with Autoscalers (Builtin Autoscaler, VPA and event-driven autoscaler like KEDA)
- Implemnet GPT2 -> DALL-E pipeline inspired from dalle-runtime