kitops icon indicating copy to clipboard operation
kitops copied to clipboard

PoC: Inference support using Triton

Open gorkem opened this issue 8 months ago • 1 comments

Describe the problem you're trying to solve

Proof of Concept (PoC) a generic inference container that uses Triton as the inference engine and can download and utilize a ModelKit as efficiently as possible.

Describe the solution you'd like

  • Generic Container or Base Container:

    • The solution can be a generic container with enough meta information or a base container that gets custom built for the ModelKit.
    • By default, the artifacts from the ModelKit should not be baked into the container. Instead, they should be downloaded by the entrypoint or an init container.
  • Model Download Options:

    • As an alternative, the model can be baked into the init container.
  • Streaming Models:

    • Explore ways to stream the models directly into the GPU memory when using Triton.

Describe alternatives you've considered

  1. Baking Artifacts into the Container:

    • Considered baking the artifacts directly into the container, but this approach lacks flexibility and can lead to larger container sizes.
  2. External Model Storage:

    • Using external storage solutions to host the models and mount them at runtime. This adds complexity and potential latency.
  3. On-Demand Model Fetching:

    • Fetching models on-demand during inference requests. This could introduce latency during the initial request.

Additional context

  • The goal is to achieve efficient and flexible model management within the inference container.
  • Consider potential performance implications of different model loading strategies, especially with respect to Triton's capabilities.
  • Ensure compatibility with existing KitOps and ModelKit infrastructure and suggest improvements.

gorkem avatar Jun 09 '24 23:06 gorkem