ai-hub-models icon indicating copy to clipboard operation
ai-hub-models copied to clipboard

QCT Genie SDK (genie-t2t-run) : Llama v2 7B performance

Open taeyeonlee opened this issue 6 months ago • 6 comments

Llama v2 7B quantized model bin file (llama_qct_genie.bin) can run on Galaxy S23 Ultra using QCT Genie SDK (genie-t2t-run), but the performance of the Llama v2 7B quantized is so slow in Galaxy S23 Ultra. the result is following. Llama v2 7B quantized model bin file (llama_qct_genie.bin) was generated, according to the tutorial (file:///opt/qcom/aistack/qairt/2.25.0.240728/docs/Genie/general/tutorials.html) cd ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/ ./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json

======================================= dm3q:/data/local/tmp $ ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json -p "Tell me about Qualcomm" Using libGenie.so version 1.0.0

[PROMPT]: Tell me about Qualcomm

hopefullythisistherightplacetopostthis.

100Mbpsover4Gisaverydifferentexperiencethan100MbpsoverWiFi. Iamnotsurewhythatshouldbedifferentfor5G. YoucanalwaystunedowntheWiFiifyouwantmorebatterylife. Idonotthinkyoucandothatwiththe5G.

the5Gisnotreallythatimportanttome. 4Ghasnotbeenanissue.

I'mlookingfora5GmodemtotestwiththeRaspberryPi. Qualcomm5GX50modem.[END] Prompt processing: 2281999 us Token generation: 353360439 us, 0.464115 tokens/s =======================================

I have some questions.

  1. Why there is no space in the text which is generated by Llama v2 7B quantized model using QCT Genie SDK (genie-t2t-run) ?
  2. Why the performance ( Token Generated Speed ) is so slow, even though the site (https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized) mentions 11.3 Tokens/s for Llama v2 7B.

taeyeonlee avatar Aug 02 '24 20:08 taeyeonlee