GerapyAutoExtractor issues

优化文本字数统计算法，兼容英文段落场景

针对[issue 22](https://github.com/Gerapy/GerapyAutoExtractor/issues/22)的问题，优化了文本字数的统计算法。该算法使用场景：中文网页 & 中文网页包含英文段落；如果text中`英文字符数量 / len(text) > 0.5`，则默认该文本以英文为主，按照“单词数量”计算，而非“字符数量”计数，进而修正“文本密度”指标（其中0.5为经验值）；否则按原逻辑统计。

yjshi2015

中文detail页面包含英文段落会导致识别准确度下降

**描述** 用的是“故宫低调点”的[最新页面](https://news.ifeng.com/c/7kQcQG2peWU)（见末尾附件），识别的结果为“特别声明”部分，而非文章实际内容。 ![detail_extract](https://user-images.githubusercontent.com/27291507/176440668-401ed121-60ab-4dcf-92b7-cfb8f9cc9622.png) **原因** 该部分主要为英文，导致“文本密度”比汉字节点的要高很多，英文的字数统计**按照字符，而非单词**，比如“hello world”字数为10，而非2，相比中文具有明显的字数优势，因此“文本密度”指标出现偏差，进而影响了节点的最终得分。具体数据如下： ![img](https://user-images.githubusercontent.com/27291507/176443114-72ea1593-f0cf-43a7-84c1-6b9f29df4405.png) **方案** 如果页面以中文为主，那么针对英文段落，其中字数的统计应该跟中文保持一致，标准统一，即按照**单词数**来统计，而非**字符**来统计。我针对number_of_char和number_of_a_char这2个方法，按照如上思路进行了优化，得到了预期结果。如下： ![img_1](https://user-images.githubusercontent.com/27291507/176445888-8746880b-10eb-4193-8977-7d8bf96f1a9c.png) **附件** 网页源代码，把后缀改为html即可 [gugong_detail.txt](https://github.com/Gerapy/GerapyAutoExtractor/files/9010893/gugong_detail.txt)

yjshi2015

bug

详情页只能提取到一个段落

1

测试 HTML： [cqie.html.zip](https://github.com/Gerapy/GerapyAutoExtractor/files/7729345/cqie.html.zip)

Germey

bug

numpy版本问题

1

从1.18版本开始，numpy正式淘汰了numpy.testing.decorators这一系列模块名（注意：只是淘汰了模块名，他们实质上包含的内容还在），并且贴心地告诉用户，要用umpy.testing.decorators里面的功能（命名空间），只需要import numpy.testing 就够了，自当前版本开始的testing自动会包含原有的testing.decorations等子模块作者可以更新一下相关的包嘛？，我用的numpy版本是1.19.5，出了点问题，如下： D:\fsy\Anaconda\python.exe D:/python_study/IntelligentAnalysis/extract.py Traceback (most recent call last): File "D:/python_study/IntelligentAnalysis/extract.py", line 1, in from gerapy_auto_extractor import extract_detail File "D:\fsy\Anaconda\lib\site-packages\gerapy_auto_extractor\__init__.py", line 4, in from gerapy_auto_extractor.classifiers.list import is_list, probability_of_list...

Smawexi

enhancement

Bug of Gerapy Auto Extractor 安装时出现问题

3

错误代码如下：之前运行好好的，总是卡在，buliding wheels这里，我试了三台电脑，都是同样的问题。能不能解释一下错误的原因和解决方案？stackoverflow上面说用conda安装就能解决，我试过依然不行。非常感谢。 Building wheels for collected packages: lxml, numpy Building wheel for lxml (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\SinoCBD\AppData\Local\Programs\Python\Python38\python.exe' -u...

dota-player

bug

报错了AttributeError: 'backports.zoneinfo.ZoneInfo' object has no attribute 'localize'

``` path = "logs/生成过内容的记录表单.csv" # 文件保存路径 WebSite = "sohu.com" # 网站主页链接 Url = "https://www.sohu.com/a/419892530_603537" # 具体内容链接 Title = extract_detail(requests.get(Url).text)["title"] # 文章的标题 ``` 报错了 AttributeError: 'backports.zoneinfo.ZoneInfo' object has no attribute 'localize'

gclsoft

bug

https://www.econ.sdu.edu.cn/zxzx/tzgg.htm 类似这种带分类链接的能智能提取吗

1

https://www.econ.sdu.edu.cn/zxzx/tzgg.htm 这种网站，由于有2个链接，导致结果为空，大佬可以更新下吗

ieliwb

enhancement

大佬，更新起来啊

2

https://github.com/kingname/GeneralNewsExtractor/ 隔壁的经常更新，你这都没更新，对比了源码，你这个更智能傻瓜式！！！

ieliwb

enhancement

函数preprocess4content_extractor的bug

函数preprocess4content_extractor中的 `for child in children(element):` 只是遍历了子，而不是遍历所有节点，是否应该改为 `for descendant in element.iterdescendants():`

zhutuo

bug

can't remove element

此函数不起作用。 def remove_element(element: Element): """ remove child element from parent :param element: :return: """ if element is None: return p = element.getparent() if p is not None: p.remove(element)

zhutuo

bug

GerapyAutoExtractor
GerapyAutoExtractor copied to clipboard

Metadata

优化文本字数统计算法，兼容英文段落场景

中文detail页面包含英文段落会导致识别准确度下降

详情页只能提取到一个段落

numpy版本问题

Bug of Gerapy Auto Extractor 安装时出现问题

报错了AttributeError: 'backports.zoneinfo.ZoneInfo' object has no attribute 'localize'

https://www.econ.sdu.edu.cn/zxzx/tzgg.htm 类似这种带分类链接的能智能提取吗

大佬，更新起来啊

函数preprocess4content_extractor的bug

can't remove element

← Metadata

Owner

Metadata

GerapyAutoExtractor GerapyAutoExtractor copied to clipboard

Metadata

← Metadata

Owner

Metadata

GerapyAutoExtractor
GerapyAutoExtractor copied to clipboard