Long-Context-Data-Engineering issues

Results 11 Long-Context-Data-Engineering issues

Sort by recently updated

ValueError: TensorParallelPreTrainedModel does not support Flash Attention 2.0 yet.

Hi, when I utilize the tensor-parallel package as the repo indicates: ``` model = transformers.LlamaForCausalLM.from_pretrained(model_path, use_flash_attention_2="flash_attention_2", torch_dtype=torch.bfloat16) tokenizer = transformers.AutoTokenizer.from_pretrained(model_path) # This is the continue pretrained LLaMA 2 7B model...

ZetangForward

论文复现相关

非常棒得一份工作，我们尝试在中文领域模型上复现这个操作，但实际使用中，参照论文参数，发现使用8*80G A100的卡，继续训练，模型训练上下文只能达到32K，看论文中可以达到80K，能否分享下这里面的技巧

chenglu66

Upsampling: Statistical biasas of distribution of dataset

I think there are some statistical biases in this implementation for long context engineering. Concern 1: For `upsample` mode, some datasets groups get `filtered` when their capacity is maxed out....

michaelfeil

In the process of tokenization of data, there is no attack defense injected into the special tokens (such as <s>, etc</s>.) existing in the data

In the process of tokenization of data, there is no attack defense for the special tokens (such as < s >,< /s >etc.) existing in the data， We found multiple...

Kwen-Chen

【评估求问】关于pretrain阶段的model，follow instruction能力应该比较差，文中的测试的方法可以分享一下吗？

hi，作者你好最近看到了这篇非常棒的paper，很感谢你的工作。在一些细节方面想请教一下。具体的就是pretrain阶段的model，follow instruction能力应该比较差，很多时候我自己尝试测试时，特别是context较长的情况下，会不停的续写、重复、很少有能够比较完整的回答问题的能力，我比较好奇文中的测试的方法是什么，可以让pretrain模型在context比较长的情况下做到。祝好！

randomtutu

【有个奇怪的问题】如果pred = expect_answer, 按照作者给的metric计算出来的分数不等于1

很奇怪，我觉得是不是哪里出了问题？ expected_answer = "eat a sandwich and sit in Dolores Park on a sunny day.".lower().split() model_response = "eat a sandwich and sit in Dolores Park on a sunny day.".lower() score...

randomtutu

Collapsed performance in short length (related to a bug in HF's LlamaDynamicNTKScalingRotaryEmbedding)

Congrats to the great work. I noticed that with the current HF code, the model will collapse with very short input length. For example, I tried a text completion task...

gaotianyu1350

When did you perform dynamic-NTK?

hi, i find you used dynamic NTK in llama-7b-80k, I'm curious about when you used it, before or after training phase? thank you for your reply

Liu-yuliang

It seems the result we get is not the same as the repo shows

![image](https://github.com/FranxYao/Long-Context-Data-Engineering/assets/34389681/afd5259e-f358-4bfd-8691-a08530961db6) this is the result we get with the code in this repo. we follow the readme step by step, making sure the environment, model and requirement are the same...

linbeyoung

Was the base frequency increased, or do you rely on position interpolation via scaling?

Hi! In your paper you mention: ``` We do not make any significant change to model architecture other than ad- justing the base of RoPE, as in Xiong et al....

tgunter

Long-Context-Data-Engineering
Long-Context-Data-Engineering copied to clipboard

Metadata

ValueError: TensorParallelPreTrainedModel does not support Flash Attention 2.0 yet.

论文复现相关

Upsampling: Statistical biasas of distribution of dataset

In the process of tokenization of data, there is no attack defense injected into the special tokens (such as <s>, etc</s>.) existing in the data

【评估求问】关于pretrain阶段的model，follow instruction能力应该比较差，文中的测试的方法可以分享一下吗？

【有个奇怪的问题】如果pred = expect_answer, 按照作者给的metric计算出来的分数不等于1

Collapsed performance in short length (related to a bug in HF's LlamaDynamicNTKScalingRotaryEmbedding)

When did you perform dynamic-NTK?

It seems the result we get is not the same as the repo shows

Was the base frequency increased, or do you rely on position interpolation via scaling?

← Metadata

Owner

Metadata

Long-Context-Data-Engineering Long-Context-Data-Engineering copied to clipboard

Metadata

← Metadata

Owner

Metadata

Long-Context-Data-Engineering
Long-Context-Data-Engineering copied to clipboard