phpspider icon indicating copy to clipboard operation
phpspider copied to clipboard

关于分页采集 怎么搞都不对

Open kavt opened this issue 2 years ago • 2 comments

http://www.qikan.com.cn/articleinfo/dinb20222801.html

http://www.qikan.com.cn/articleinfo/dinb20222801-1.html

这是默认详情页和分页 $configs = array( 'name' => 'diannaobao', 'log_show' => true, 'max_fields' => 1, //最大采集2条 每次 'domains' => array( 'www.qikan.com.cn' ),

//入口



'scan_urls' => array(
    "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html"   //  http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),





//内容 也对了
 'content_url_regexes' => array(
        "http://www.qikan.com.cn/article/\S+",  //http://www.qikan.com.cn/article/dinb20222701.html
       // "http://www.qikan.com.cn/articleinfo/\s+"
    ),


'fields' => array(



    array(
        'name' => "contents",
        'selector' => "//div[contains(@class,'art-pre')]//a//@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
               
                // 发送 attached_url 请求获取其他的分页数据
                // attached_url 使用了上面抓取的 content_page_url
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),
    ),

采集到了分页,但是内容都是重复的,我就不明白content_page_url到底是啥意思

kavt avatar Jul 31 '22 05:07 kavt

@owner888

kavt avatar Jul 31 '22 06:07 kavt

你搞定了吗,我也是没搞懂, 内容是重复的

ishwy avatar Aug 11 '22 04:08 ishwy