mobile_app_open icon indicating copy to clipboard operation
mobile_app_open copied to clipboard

LLM Backend Implementation

Open farook-edev opened this issue 5 months ago • 0 comments

This issue relates to the LLM Pipeline and related backend/driver changes.

Current Implementation Information*:

  • Pipeline is able to initialize the model and interpreter, allocate tensors, set up KV caches.
  • Pipeline Accepts input in format of std::vector<int>, int (list of tokens, end_token_id).
  • Pipeline is able to run inference and provide proper output.
  • Output is in format of std::vector<int> (list of tokens).
  • Pipeline can be configured to allow a different delegate, a set number of output tokens, and a certain number of CPU threads (currently, values are in-code only).
  • Pipeline is able to call a first_token_callback provided by the driver, to indicate to LoadGen TTFT.

Todo:

  • [ ] Combine first token inference function into normal inference function.**
  • [x] Code Formatting and linting.
  • [x] Changing issue_query() signature to comply with changes made to backend interface.
  • [x] Resolve any remaining code quality and CI issues.
  • [ ] Potentially provide logits to a potential cross-backend decoder instead of building one inside the pipeline (Discussion needed)

* This relates to the default implementation using TFLite (LiteRT) on a CPU delegate. ** This only affects how the code looks, the inference is still functional.

Any other discussions or requirements relating to the pipeline should go here.

farook-edev avatar Oct 01 '25 23:10 farook-edev