phpspider content_url_regexes怎么写呢？

content_url_regexes怎么写呢？

Open czly opened this issue 7 years ago • 3 comments

请问我想采集 https://movie.douban.com/subject/1307793/ 的 content_url_regexes怎么写呢？想排除包含最后一个/ 后面的内容，不然效率低还容易被办 https://movie.douban.com/subject/1307793/questions/ask/%3Ffrom%3Dsubject_top

May 13 '17 05:05 czly

如果不能过滤不相关的网址，数据是没有准确性的

May 13 '17 05:05 czly

2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/collections 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/questions/ask?from=subject_top 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/celebrities 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/all_photos 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/mupload 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1298038/?from=subject-page 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1304585/?from=subject-page

这些都想排除怎么办？效率低了

May 13 '17 05:05 czly

这么写 https://movie.douban.com/subject/\d+/，这样那些就不会被当做内容页收集了

May 16 '17 03:05 owner888

phpspider phpspider copied to clipboard

content_url_regexes怎么写呢？

phpspider
phpspider copied to clipboard