llama
llama copied to clipboard
Any plan to increase the model's context window and output token limit?
GPT 3.5 has 4096 token context window. Do you plan to increase the model's context window and output token limit? I am not a expert in this field but this seems like a good way: Parallel Context Windows Improve In-Context Learning of Large Language Models
For applications that require processing large amounts of text at inference time, Large Language Models (LLMs) are handicapped by their limited context windows, which are typically 2048 tokens. In-context learning, an emergent phenomenon in LLMs in sizes above a certain parameter threshold, constitutes one significant example because it can only leverage training examples that fit into the context window. Existing efforts to address the context window limitation involve training specialized architectures, which tend to be smaller than the sizes in which in-context learning manifests due to the memory footprint of processing long texts. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows'') that fit within the architecture, restrict the attention mechanism to apply only within each window, and re-use the positional embeddings among the windows. We test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. Our results motivate further investigation of Parallel Context Windows as a method for applying off-the-shelf LLMs in other settings that require long text sequences. https://arxiv.org/abs/2212.10947
GPT 3.5 has 4096 token context window. Do you plan to increase the model's context window and output token limit? I am not a expert in this field but this seems like a good way: Parallel Context Windows Improve In-Context Learning of Large Language Models
For applications that require processing large amounts of text at inference time, Large Language Models (LLMs) are handicapped by their limited context windows, which are typically 2048 tokens. In-context learning, an emergent phenomenon in LLMs in sizes above a certain parameter threshold, constitutes one significant example because it can only leverage training examples that fit into the context window. Existing efforts to address the context window limitation involve training specialized architectures, which tend to be smaller than the sizes in which in-context learning manifests due to the memory footprint of processing long texts. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows'') that fit within the architecture, restrict the attention mechanism to apply only within each window, and re-use the positional embeddings among the windows. We test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. Our results motivate further investigation of Parallel Context Windows as a method for applying off-the-shelf LLMs in other settings that require long text sequences. https://arxiv.org/abs/2212.10947
to expand the context window, you would need to retrain the entire model.
GPT 3.5 has 4096 token context window. Do you plan to increase the model's context window and output token limit? I am not a expert in this field but this seems like a good way: Parallel Context Windows Improve In-Context Learning of Large Language Models For applications that require processing large amounts of text at inference time, Large Language Models (LLMs) are handicapped by their limited context windows, which are typically 2048 tokens. In-context learning, an emergent phenomenon in LLMs in sizes above a certain parameter threshold, constitutes one significant example because it can only leverage training examples that fit into the context window. Existing efforts to address the context window limitation involve training specialized architectures, which tend to be smaller than the sizes in which in-context learning manifests due to the memory footprint of processing long texts. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows'') that fit within the architecture, restrict the attention mechanism to apply only within each window, and re-use the positional embeddings among the windows. We test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. Our results motivate further investigation of Parallel Context Windows as a method for applying off-the-shelf LLMs in other settings that require long text sequences. https://arxiv.org/abs/2212.10947
to expand the context window, you would need to retrain the entire model.
the paper above explicitly mentioned that its design allows the model to work off the shelf with that implementation. it breaks the larger context space into smaller chunks and reuses/updates the positional embeddings of the first section to develop a more complete but memory efficient model of the data. its similar to how the longformers architecture handles the larger context window by repeating the first set of positional embeddings sequentially
We just implemented PCW for LLaMA in the PCW repo: https://github.com/AI21Labs/Parallel-Context-Windows
Feel free to try it out.
GPT 3.5 has 4096 token context window. Do you plan to increase the model's context window and output token limit? I am not a expert in this field but this seems like a good way: Parallel Context Windows Improve In-Context Learning of Large Language Models For applications that require processing large amounts of text at inference time, Large Language Models (LLMs) are handicapped by their limited context windows, which are typically 2048 tokens. In-context learning, an emergent phenomenon in LLMs in sizes above a certain parameter threshold, constitutes one significant example because it can only leverage training examples that fit into the context window. Existing efforts to address the context window limitation involve training specialized architectures, which tend to be smaller than the sizes in which in-context learning manifests due to the memory footprint of processing long texts. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows'') that fit within the architecture, restrict the attention mechanism to apply only within each window, and re-use the positional embeddings among the windows. We test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. Our results motivate further investigation of Parallel Context Windows as a method for applying off-the-shelf LLMs in other settings that require long text sequences. https://arxiv.org/abs/2212.10947
to expand the context window, you would need to retrain the entire model.
the paper above explicitly mentioned that its design allows the model to work off the shelf with that implementation. it breaks the larger context space into smaller chunks and reuses/updates the positional embeddings of the first section to develop a more complete but memory efficient model of the data. its similar to how the longformers architecture handles the larger context window by repeating the first set of positional embeddings sequentially
without retraining the model?
Works with the model without any further training. Check out the paper for a further explanation: https://arxiv.org/abs/2212.10947 Note that it usually works best with bigger models, so if your used case doesn't work well with the 7B LLaMA, try the 13B or 33B.
Closing as we launched Llama 2 with bigger context window.