streaming-llm
streaming-llm copied to clipboard
Run with start_size=0 looks just fine
I've run a number of experiments and it looks like that most of the performance comes from enabling pos_shift.
python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8
6.840701103210449
python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 1 --recent_size 255
29.674755096435547
python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 0 --recent_size 256 --enable_pos_shift
8.8959321975708
python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 1 --recent_size 255 --enable_pos_shift
7.493190765380859
python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 4 --recent_size 252 --enable_pos_shift
7.363883018493652
And also generated output of the following script looks fine to me. python examples/run_streaming_llama.py --enable_streaming --recent_size 128 --start_size 0
Am I doing something wrong? (choice of model or dataset could matter??) Is it okay to conclude that major factor which harms generation performance is wrongly-used pos encoding?