Long-Context-Data-Engineering icon indicating copy to clipboard operation
Long-Context-Data-Engineering copied to clipboard

Implementation of paper Data Engineering for Scaling Language Models to 128K Context

Results 11 Long-Context-Data-Engineering issues
Sort by recently updated
recently updated
newest added

Hi, when I utilize the tensor-parallel package as the repo indicates: ``` model = transformers.LlamaForCausalLM.from_pretrained(model_path, use_flash_attention_2="flash_attention_2", torch_dtype=torch.bfloat16) tokenizer = transformers.AutoTokenizer.from_pretrained(model_path) # This is the continue pretrained LLaMA 2 7B model...

非常棒得一份工作,我们尝试在中文领域模型上复现这个操作,但实际使用中,参照论文参数,发现使用8*80G A100的卡,继续训练,模型训练上下文只能达到32K,看论文中可以达到80K,能否分享下这里面的技巧

I think there are some statistical biases in this implementation for long context engineering. Concern 1: For `upsample` mode, some datasets groups get `filtered` when their capacity is maxed out....

In the process of tokenization of data, there is no attack defense for the special tokens (such as < s >,< /s >etc.) existing in the data, We found multiple...

hi, 作者你好 最近看到了这篇非常棒的paper,很感谢你的工作。 在一些细节方面想请教一下。具体的就是pretrain阶段的model,follow instruction能力应该比较差,很多时候我自己尝试测试时,特别是context较长的情况下,会不停的续写、重复、很少有能够比较完整的回答问题的能力,我比较好奇文中的测试的方法是什么,可以让pretrain模型在context比较长的情况下做到。 祝好!

很奇怪,我觉得是不是哪里出了问题? expected_answer = "eat a sandwich and sit in Dolores Park on a sunny day.".lower().split() model_response = "eat a sandwich and sit in Dolores Park on a sunny day.".lower() score...

Congrats to the great work. I noticed that with the current HF code, the model will collapse with very short input length. For example, I tried a text completion task...

hi, i find you used dynamic NTK in llama-7b-80k, I'm curious about when you used it, before or after training phase? thank you for your reply

![image](https://github.com/FranxYao/Long-Context-Data-Engineering/assets/34389681/afd5259e-f358-4bfd-8691-a08530961db6) this is the result we get with the code in this repo. we follow the readme step by step, making sure the environment, model and requirement are the same...

Hi! In your paper you mention: ``` We do not make any significant change to model architecture other than ad- justing the base of RoPE, as in Xiong et al....