cx-extractor 损耗时间的一步

损耗时间的一步

Open GoogleCodeExporter opened this issue 9 years ago • 1 comments

source = links.matcher(source).replaceAll("");

样例：http://news.itxinwen.com/2013/0802/515691.shtml

单是这一步 将耗时90s+

建议：可以直接通过source = source.replaceAll("<[^>]+>", "");  
移除所有Tag?

Original issue reported on code.google.com by [email protected] on 2 Aug 2013 at 8:01

Mar 08 '16 08:03 GoogleCodeExporter

private static Pattern links = Pattern.compile("<[^>]+>.*?</[aA]>");

考虑到<a>contents<a>这样更好些

唯一的缺陷是 如果正文有带有超链接的文字段也将被删除了

Original comment by [email protected] on 2 Aug 2013 at 9:57

Mar 08 '16 08:03 GoogleCodeExporter

cx-extractor cx-extractor copied to clipboard

损耗时间的一步

cx-extractor
cx-extractor copied to clipboard