LooGLE
LooGLE copied to clipboard
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
请问评测集里面有中文评测集吗,我看好像都是英文的
你好,在pred_gpt_models.py中get_pre()方法是否有bug? def get_pred(model, data_instance, tokenizer, max_length, max_gen, prompt_format, device): ans, groundtruth = [], [] preds = {} raw_inputs = data_instance['input'] if data_instance['qa_pairs'] == 'none': preds['qa_pairs'] = data_instance['qa_pairs'] json_obj = {'input':...
Hi! I have read the codes for open source model evaluation. I noticed that, different from some existing benchmarks such as LongBench or L-Eval, there is not prompt customization part...
我看论文里选了llama拓展到32k长度的做摘要评估,然后其他的一些longllama,gpt之类的可能多少都有指令微调过,已经有了对相应任务的理解,不确定你们选的这个llama32k是不是以language model的形式拓展长度的,如果是这样,怎么确定比较公平性哇? 或者有没有考虑引入llama-chat版本还有一些其他的指令微调且长度拓展的llama模型做评估哦
Thank you for your outstanding work, but I encountered the following problem during testing. The single A100 (80GB) card has insufficient memory when predicting with overly long context. I am...
在pytorch官网上查找得到cu121最低只到torch2.1.0, torch2.0.1需要cu117或118
表7中long denpendency qa中的子任务上的分值都小于50,但是最后报出来的整体分数却达到54.09,同样都是gpt4进行打分,为什么会对不齐?
For long dependency QA, I suppose the 'S' key under 'qa_pairs' should be the annotated original text? But I found that most of the time the context in 'S' cannot...
您好,感谢您的文章。在您的论文中看到了评估方式包含人工评估,所以请问哪些任务采取的是人工评估的方式呢?