phpspider
phpspider copied to clipboard
content_url_regexes怎么写呢?
请问我想采集 https://movie.douban.com/subject/1307793/ 的 content_url_regexes怎么写呢? 想排除包含最后一个/ 后面的内容,不然效率低还容易被办 https://movie.douban.com/subject/1307793/questions/ask/%3Ffrom%3Dsubject_top
如果不能过滤不相关的网址,数据是没有准确性的
2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/collections 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/questions/ask?from=subject_top 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/celebrities 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/all_photos 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1292401/mupload 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1298038/?from=subject-page 2017-05-13 13:10:42 [debug] Find content page: https://movie.douban.com/subject/1304585/?from=subject-page
这些都想排除怎么办?效率低了
这么写 https://movie.douban.com/subject/\d+/,这样那些就不会被当做内容页收集了