bubble issues

Results 7 issues of


                                            bubble

第七章 fine-tune代码优化。SSC任务CPU上36小时变2小时

您好，我发现第七章代码中有处地方能够优化一下。 tokenizer函数中，可以去掉padding='max_length'，浪费计算资源。transformer提供的Trainer构造时的data_collator参数默认采用了动态补全的方法，按照batch进行补全，能够节省计算资源。在我的CPU上跑，时间从36小时变为2小时（没跑完，进度条给的预估时间）

How to cache my mixture

I noticed the annotation ```python # If you're using Seqio, we suggest caching your mixture as they take a while to generate. ``` But I don't know how to do...

数据集

请问[make_data_example.py](https://github.com/ssbuild/chatglm_finetuning/blob/dev/data/make_data_example.py)中将数据重复100次, 是想要研究多少数据才能让模型记住这个例子吗? 有相关的结论吗?

开源数据仅使用JEC-QA的train构造吗? 有留出test集吗

以及 judical_examination.json | 2,000 | ChatGPT生成的法考题解答 -- | -- | -- judical_examination_v2.json | 5,000 | ChatGPT生成的法考题解答（第二批公开）这两份数据是使用JE-Q2EA还是JE-QA2E方法构造的呀 ? 有什么其他的处理吗 ?

关于表1模型性能中✔️和🌸的符号

The tick ✔️ means that the corresponding corpus/dataset has been used at the previous stage, while the flower 🌸 means the corpus/dataset is employed for training at the current stage....

Evaluation for humaneval

What is the processing of the inference and extraction for humaneval? My test score is only 20.73 for humaneval (k=1, model = CodeLlama-7b-Instruct-hf+DPO ).

prompt for evaluating Code Fixing in readme3.3/paper4.3

请问用于代码修复能力评估的prompt是啥呀

bubble