executorch icon indicating copy to clipboard operation
executorch copied to clipboard

Introduce seq_len as inference param, and improve warnings

Open abhinaykukkadapu opened this issue 1 month ago • 6 comments

Summary: Changes:

  1. add --seq_len param to llama script to distinguish max_seq_len which is compile time param
  2. Add warnings in the runner when seq_len is clamped to max_seq_len to avoid silently clamping it.
  3. Add warnings in the token generator when EOS is not reached due to insufficient seq_len or max_seq_len.

Differential Revision: D86696759

Tests

Use --seq_len=600, prompt_len=512

I 00:00:02.883890 executorch:token_generator.cpp:333] Warning: Generation stopped at seq_len limit (600) without reaching EOS token. Response may be incomplete.
I 00:00:02.884094 executorch:token_generator.cpp:346] - seq_len (600) is less than compiled max_seq_len (1024). Consider increasing --seq_len (up to 1024).

Use --seq_len=2048, prefill_ar_len=1024

I 00:00:00.546967 executorch:runner.cpp:385] Warning: Requested seq_len (2048) exceeds compiled max_seq_len (1024). Clamping to 1024.

abhinaykukkadapu avatar Nov 10 '25 20:11 abhinaykukkadapu

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15716

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit 76b9c2863a90551bdc3283a83b128969346be5d1 with merge base e774b7796bd465d30dff39799110e63f9a8e0a99 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Nov 10 '25 20:11 pytorch-bot[bot]

@abhinaykukkadapu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86696759.

meta-codesync[bot] avatar Nov 10 '25 20:11 meta-codesync[bot]

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example @pytorchbot label "release notes: none"

For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

github-actions[bot] avatar Nov 10 '25 20:11 github-actions[bot]

Hi @abhinaykukkadapu, Thanks for the PR. I would like to know if we could achieve the same thing with the combination of --pre_gen_pte & --max_seq_len.

For example: During compile time, you can provide: --max_seq_len 1024 --compile_only

During inference, you can provide: --max_seq_len 512 --pre_gen_pte ./path_to_pregen_pte

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.

abhinaykukkadapu avatar Nov 11 '25 03:11 abhinaykukkadapu

@winskuo-quic Thanks for the quick review, the goal of this additional param is to avoid confusing the users of the script thinking that --max_seq_len can be dynamic but it is a static param and is fixed during compilation.

Currently, one can pass --max_seq_len for inference which actually made me think that total context length can be changed dynamically and i only found out after digging through the code that is is piped as seq_len to the runner. With this new param, we have a clear distinction, that --max_seq_len is used for compile time and --seq_len is during runtime/ inference. Functionally this change is a no-op and it only improves user experience and making it visible to the user when we clamp it down.

I see. I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful. However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

winskuo-quic avatar Nov 11 '25 08:11 winskuo-quic

I think it makes sense to add some warning messages in .cpp files to guide users, which is helpful.

@winskuo-quic Right, i think we should be transparent in these, i've already added the messages that i think would be helpful, but suggest if you have any more in mind.

However, for llama.py, I would like to know if you think flag --max_seq_len is misleading? The --max_seq_len flag can actually be set to a different number every time during execution, as long as it is shorter than the --max_seq_len used during compilation.

Yeah i think this param is misleading as it clearly represents max context a model can have, so using it to during inference is a bit misleading for someone new to Qcom delegate, who might not know we use static llama and might think this can change total context length of the model dynamically. Also, all i did is use the same param of qnn runner (--seq_len) and bubbled up to llama.py script.

If you think this adds to the confusion, I'm also open to remove it and only keep warning messages in this PR.

abhinaykukkadapu avatar Nov 11 '25 17:11 abhinaykukkadapu

Let's rename the PR title because it's not introducing seq_len anymore

cccclai avatar Nov 17 '25 23:11 cccclai