ai-hub-models
ai-hub-models copied to clipboard
QCT Genie SDK (genie-t2t-run) : Llama v2 7B performance
Llama v2 7B quantized model bin file (llama_qct_genie.bin) can run on Galaxy S23 Ultra using QCT Genie SDK (genie-t2t-run), but the performance of the Llama v2 7B quantized is so slow in Galaxy S23 Ultra. the result is following. Llama v2 7B quantized model bin file (llama_qct_genie.bin) was generated, according to the tutorial (file:///opt/qcom/aistack/qairt/2.25.0.240728/docs/Genie/general/tutorials.html) cd ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/ ./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json
======================================= dm3q:/data/local/tmp $ ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json -p "Tell me about Qualcomm" Using libGenie.so version 1.0.0
[PROMPT]: Tell me about Qualcomm
hopefullythisistherightplacetopostthis.
100Mbpsover4Gisaverydifferentexperiencethan100MbpsoverWiFi. Iamnotsurewhythatshouldbedifferentfor5G. YoucanalwaystunedowntheWiFiifyouwantmorebatterylife. Idonotthinkyoucandothatwiththe5G.
the5Gisnotreallythatimportanttome. 4Ghasnotbeenanissue.
I'mlookingfora5GmodemtotestwiththeRaspberryPi. Qualcomm5GX50modem.[END] Prompt processing: 2281999 us Token generation: 353360439 us, 0.464115 tokens/s =======================================
I have some questions.
- Why there is no space in the text which is generated by Llama v2 7B quantized model using QCT Genie SDK (genie-t2t-run) ?
- Why the performance ( Token Generated Speed ) is so slow, even though the site (https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized) mentions 11.3 Tokens/s for Llama v2 7B.