LLM Backend Implementation

Open farook-edev opened this issue 5 months ago • 0 comments

This issue relates to the LLM Pipeline and related backend/driver changes.

Current Implementation Information*:

Pipeline is able to initialize the model and interpreter, allocate tensors, set up KV caches.
Pipeline Accepts input in format of std::vector<int>, int (list of tokens, end_token_id).
Pipeline is able to run inference and provide proper output.
Output is in format of std::vector<int> (list of tokens).
Pipeline can be configured to allow a different delegate, a set number of output tokens, and a certain number of CPU threads (currently, values are in-code only).
Pipeline is able to call a first_token_callback provided by the driver, to indicate to LoadGen TTFT.

Todo:

[ ] Combine first token inference function into normal inference function.**
[x] Code Formatting and linting.
[x] Changing issue_query() signature to comply with changes made to backend interface.
[x] Resolve any remaining code quality and CI issues.
[ ] Potentially provide logits to a potential cross-backend decoder instead of building one inside the pipeline (Discussion needed)

* This relates to the default implementation using TFLite (LiteRT) on a CPU delegate. ** This only affects how the code looks, the inference is still functional.

Any other discussions or requirements relating to the pipeline should go here.

Oct 01 '25 23:10 farook-edev