jparser
jparser copied to clipboard
A readability parser which can extract title, content, images from html pages
fix bugs
- make code compatible with Python3 - cleaning and linting - try/catch fix around type error in model.py
需要加一个判断 在model.py文件中需要加入: if not isinstance(t, str): continue ` import re import lxml import lxml.html import urllib.parse from .tags_util import clean_tags_only, clean_tags_hasprop, clean_tags_exactly, clean_tags from .region import Region class PageModel(object): def...
网页正文去除噪声数据
你好,我最近也在做相关工作,一般的网页正文都是有很多多余的噪声数据,需要去除,这块有考虑后面加吗
model.py 修改为 ` #!/bin/env python #encoding=utf-8 import re import lxml import lxml.html import urllib from .tags_util import clean_tags_only, clean_tags_hasprop, clean_tags_exactly, clean_tags from .region import Region class PageModel(object): def __init__(self, page,...