LooGLE issues

你好，在pred_gpt_models.py中get_pre()方法是否有bug？ def get_pred(model, data_instance, tokenizer, max_length, max_gen, prompt_format, device): ans, groundtruth = [], [] preds = {} raw_inputs = data_instance['input'] if data_instance['qa_pairs'] == 'none': preds['qa_pairs'] = data_instance['qa_pairs'] json_obj = {'input':...

Duandand

Prompt format for different models

2

Hi! I have read the codes for open source model evaluation. I noticed that, different from some existing benchmarks such as LongBench or L-Eval, there is not prompt customization part...

Mooler0410

Question about model selection

3

我看论文里选了llama拓展到32k长度的做摘要评估，然后其他的一些longllama，gpt之类的可能多少都有指令微调过，已经有了对相应任务的理解，不确定你们选的这个llama32k是不是以language model的形式拓展长度的，如果是这样，怎么确定比较公平性哇？或者有没有考虑引入llama-chat版本还有一些其他的指令微调且长度拓展的llama模型做评估哦

AresXD

Insufficient A100 memory for overly long context.

1

Thank you for your outstanding work, but I encountered the following problem during testing. The single A100 (80GB) card has insufficient memory when predicting with overly long context. I am...

cuiwenyao

torch 2.0.1+cu121？

1

在pytorch官网上查找得到cu121最低只到torch2.1.0, torch2.0.1需要cu117或118

hzh12345678

The indicators reported in the paper are not aligned

1

表7中long denpendency qa中的子任务上的分值都小于50，但是最后报出来的整体分数却达到54.09，同样都是gpt4进行打分，为什么会对不齐？

LeiyanGithub

It seems that the annotated original text are not provided?

5

For long dependency QA, I suppose the 'S' key under 'qa_pairs' should be the annotated original text? But I found that most of the time the context in 'S' cannot...

waylonli