jamestch issues

Results 13 issues of


                                            jamestch

GAnswer系统(中文问答版本)，导入eclipse，运行GanswerHttp.java报错

报错信息如下： java.io.FileNotFoundException: data\pkubase\paraphrase\mini-mention2ent.txt (系统找不到指定的文件。) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:213) at java.base/java.io.FileInputStream.(FileInputStream.java:155) at java.base/java.io.FileInputStream.(FileInputStream.java:110) at java.base/java.io.FileReader.(FileReader.java:60) at utils.FileUtil.readFile(FileUtil.java:14) at qa.extract.EntityRecognitionCh.(EntityRecognitionCh.java:125) at paradict.ParaphraseDictionary.addPredicateAsNLPattern(ParaphraseDictionary.java:250) at paradict.ParaphraseDictionary.(ParaphraseDictionary.java:71) at qa.Globals.init(Globals.java:50) at application.GanswerHttp.main(GanswerHttp.java:71) \data\pkubase\paraphrase目录下文件名如下： +ccksminutf.txt +pkubase-mention2ent.txt...

Deploy GAnswer via jar后，http请求返回结果异常

### 按照文档进行jar包部署，且出现了"Server Ready!"。但发送请求： http://ip:port/gSolve/?data={maxAnswerNum:3,%20maxSparqlNum:2,%20question:Who%20is%20the%20wife%20of%20Donald%20Trump?} 返回结果如下： {"question":"Who is the wife of Donald Trump?","vars":["?wife"],"sparql":["select DISTINCT ?wife where { \t\t?wife. } LIMIT 3"],"results":{"bindings":[{"?wife":{"type":"uri","value":""}},{"?wife":{"type":"uri","value":""}},{"?wife":{"type":"uri","value":""}}]},"status":"200"} 返回结果中value为空，但是我看后台输出日志里面，实际已经查询到了结果，如下： ==========Group Simple Relations========= ========================================= Check query graph...

《Deeplearning深度学习笔记v5.52.pdf》4.2中P107页有误

反向传播步骤（2），𝑑𝑤[𝑙]=𝑑𝑧[𝑙]⋅𝑎[𝑙−1]，其中𝑎[𝑙−1]需要转置，否则维度对不上？不知道理解是否正确，请指正。

ValueError: The truth value of an array with more than one element is ambiguous

**Hi, Luca Weihs! I have downloaded your source code from the github, and ran as you told in README.md. But I met the error message as below**: Traceback (most recent...

关于python包rdflib解析RDF文件性能的问题

大家好，我想用rdflib解析dbpedia的数据包mappingbased_objects_en.ttl，格式是NTriples，数据包大小大概2.4G。 g=rdflib.Graph() g.parse(bz2.open(r"../data/mappingbased_objects_en.ttl.bz2"),format="nt") 加载及其慢，而且消耗内存和计算资源。请教下大家有什么好的解决方案？

help wanted

question

utils.py文件中数据处理缺少[SEP]

您好，关于utils.py中build_dataset方法中对输入文本进行pading部分： def load_dataset(path, pad_size=32): contents = [] with open(path, 'r', encoding='utf-8') as f: for line in tqdm(f): lin = line.strip() if not lin: continue content, label = lin.split('\t') token =...

AssertionError: Loading a checkpoint for MP=0 but world size is 2

Hello all, I'm trying to use the 13B model on a machine with two GPUs (NVIDIA Tesla V100s, 32GB) with the following command: $torchrun --nproc_per_node 2 example.py --ckpt_dir /path_to/llama/13B --tokenizer_path...

词表合并问题

请教各位大佬：我在领域中文语料上训练了基于[sentencepiece](https://github.com/google/sentencepiece)的中文词表myly.model，请问与LLaMa原来的词表tokenizer.model如何进行合并？

中文二次预训练

您好，能否公开基于中文语料对LLaMa进行二次增量预训练的代码

关于instances_buffer_size参数的问题

增量预训练时，统计训练数据大概有88873773个样本，instances_buffer_size默认值为25600。Dataloader类中_fill_buf方法中： **_if len(self.buffer) >= self.instances_buffer_size: break_** 我理解instances_buffer_size=88873773是不是才能遍历所有的训练样本，但是设置太多是不是内存会爆掉。如果是这样，有没有什么方法能保证遍历所有样本？不知道理解对不对，请大佬指正~