phpspider
phpspider copied to clipboard
关于分页采集 怎么搞都不对
http://www.qikan.com.cn/articleinfo/dinb20222801.html
http://www.qikan.com.cn/articleinfo/dinb20222801-1.html
这是默认详情页和分页 $configs = array( 'name' => 'diannaobao', 'log_show' => true, 'max_fields' => 1, //最大采集2条 每次 'domains' => array( 'www.qikan.com.cn' ),
//入口
'scan_urls' => array(
"http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html" // http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),
//内容 也对了
'content_url_regexes' => array(
"http://www.qikan.com.cn/article/\S+", //http://www.qikan.com.cn/article/dinb20222701.html
// "http://www.qikan.com.cn/articleinfo/\s+"
),
'fields' => array(
array(
'name' => "contents",
'selector' => "//div[contains(@class,'art-pre')]//a//@href", ////div[contains(@class,'art-pre')]//a//@href
////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
////div[contains(@class,'art-pre')]//a//@href
'repeated' => true,
'required' => true,//必填
'children' => array(
array(
// 抽取出其他分页的url待用
'name' => 'content_page_url',
'selector' => "//text()"
),
array(
// 抽取其他分页的内容
'name' => 'page_content',
// 发送 attached_url 请求获取其他的分页数据
// attached_url 使用了上面抓取的 content_page_url
'source_type' => 'attached_url',
'attached_url' => 'content_page_url', // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
'selector' => "//div[contains(@class,'textWrap')]"
),
),
),
采集到了分页,但是内容都是重复的,我就不明白content_page_url到底是啥意思
@owner888
你搞定了吗,我也是没搞懂, 内容是重复的