kitops PoC: Inference support using Triton

PoC: Inference support using Triton

Open gorkem opened this issue 8 months ago • 1 comments

Describe the problem you're trying to solve

Proof of Concept (PoC) a generic inference container that uses Triton as the inference engine and can download and utilize a ModelKit as efficiently as possible.

Describe the solution you'd like

Generic Container or Base Container:
- The solution can be a generic container with enough meta information or a base container that gets custom built for the ModelKit.
- By default, the artifacts from the ModelKit should not be baked into the container. Instead, they should be downloaded by the entrypoint or an init container.
Model Download Options:
- As an alternative, the model can be baked into the init container.
Streaming Models:
- Explore ways to stream the models directly into the GPU memory when using Triton.

Describe alternatives you've considered

Baking Artifacts into the Container:
- Considered baking the artifacts directly into the container, but this approach lacks flexibility and can lead to larger container sizes.
External Model Storage:
- Using external storage solutions to host the models and mount them at runtime. This adds complexity and potential latency.
On-Demand Model Fetching:
- Fetching models on-demand during inference requests. This could introduce latency during the initial request.

Additional context

The goal is to achieve efficient and flexible model management within the inference container.
Consider potential performance implications of different model loading strategies, especially with respect to Triton's capabilities.
Ensure compatibility with existing KitOps and ModelKit infrastructure and suggest improvements.

Jun 09 '24 23:06 gorkem

kitops kitops copied to clipboard

PoC: Inference support using Triton

Describe the problem you're trying to solve

Describe the solution you'd like

Describe alternatives you've considered

Additional context

kitops
kitops copied to clipboard