kitops
kitops copied to clipboard
PoC: Inference support using Triton
Describe the problem you're trying to solve
Proof of Concept (PoC) a generic inference container that uses Triton as the inference engine and can download and utilize a ModelKit as efficiently as possible.
Describe the solution you'd like
-
Generic Container or Base Container:
- The solution can be a generic container with enough meta information or a base container that gets custom built for the ModelKit.
- By default, the artifacts from the ModelKit should not be baked into the container. Instead, they should be downloaded by the entrypoint or an init container.
-
Model Download Options:
- As an alternative, the model can be baked into the init container.
-
Streaming Models:
- Explore ways to stream the models directly into the GPU memory when using Triton.
Describe alternatives you've considered
-
Baking Artifacts into the Container:
- Considered baking the artifacts directly into the container, but this approach lacks flexibility and can lead to larger container sizes.
-
External Model Storage:
- Using external storage solutions to host the models and mount them at runtime. This adds complexity and potential latency.
-
On-Demand Model Fetching:
- Fetching models on-demand during inference requests. This could introduce latency during the initial request.
Additional context
- The goal is to achieve efficient and flexible model management within the inference container.
- Consider potential performance implications of different model loading strategies, especially with respect to Triton's capabilities.
- Ensure compatibility with existing KitOps and ModelKit infrastructure and suggest improvements.