tabby [PlayGround] Long generating cannot be done

Describe the bug In playground, when the generating is long, it cannot finish generating in one time, in this case I typied "continue" to let it keep going. But it didn't work, it repeated from the beginning or returned empty.

Information about your version tabby 0.7.0

Information about your GPU

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:02:00.0 Off |                  N/A |
| 30%   26C    P8              21W / 350W |  22253MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:03:00.0 Off |                  N/A |
| 30%   26C    P8              28W / 350W |  13576MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Additional context Add any other context about the problem here.

Feb 01 '24 07:02 Dorish

For now, there is a hard limitation of 2048 input tokens and a maximum of 1920 output tokens. We might consider increasing these numbers in the future.

Feb 01 '24 07:02 wsxiaoys

For now, there is a hard limitation of 2048 input tokens and a maximum of 1920 output tokens. We might consider increasing these numbers in the future.

If the output > 1920 tokens (or lager maximum amount in the future), is there any way to let it continue to output all?

Feb 02 '24 02:02 Dorish

For now, there is a hard limitation of 2048 input tokens and a maximum of 1920 output tokens. We might consider increasing these numbers in the future.

Is there are reason as to why context is static for Tabby? @wsxiaoys

Feb 05 '24 03:02 carlosech

Hey - could you elaborate a bit? static context length is kind of a intrinsic thing to transformer based LLMs.

Feb 05 '24 17:02 wsxiaoys

Hey - could you elaborate a bit? static context length is kind of a intrinsic thing to transformer based LLMs.

Sorry. Dependent on the models, the context can be increased to a certain size. You state that there is a hard limitation of input and output tokens, is that hard coded in Tabby? Or variable due to the model being used?

Feb 06 '24 01:02 carlosech

Sorry. Dependent on the models, the context can be increased to a certain size.

Ah - I got you point. It does make sense to read this value from either the registry or from gguf files directly. Filing https://github.com/TabbyML/tabby/issues/1402 to track

Feb 07 '24 02:02 wsxiaoys