phpspider icon indicating copy to clipboard operation
phpspider copied to clipboard

content_url_regexes怎么写呢?

Open czly opened this issue 7 years ago • 3 comments

请问我想采集 https://movie.douban.com/subject/1307793/ 的 content_url_regexes怎么写呢? 想排除包含最后一个/ 后面的内容,不然效率低还容易被办 https://movie.douban.com/subject/1307793/questions/ask/%3Ffrom%3Dsubject_top

czly avatar May 13 '17 05:05 czly

如果不能过滤不相关的网址,数据是没有准确性的

czly avatar May 13 '17 05:05 czly

2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/collections 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/questions/ask?from=subject_top 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/celebrities 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/all_photos 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/mupload 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1298038/?from=subject-page 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1304585/?from=subject-page

这些都想排除怎么办?效率低了

czly avatar May 13 '17 05:05 czly

这么写 https://movie.douban.com/subject/\d+/,这样那些就不会被当做内容页收集了

owner888 avatar May 16 '17 03:05 owner888