phpspider icon indicating copy to clipboard operation
phpspider copied to clipboard

关于attached_url的bug

Open yingzheng1980 opened this issue 2 years ago • 5 comments

phpspider.php 文件的第 2114 行,下载应该使用 $collect_url $html = requests::$method($collect_url, $params);

否则不会去下载 attached_url

yingzheng1980 avatar Jun 07 '22 14:06 yingzheng1980

phpspider.php 文件的第 2114 行,下载应该使用 $collect_url $html = requests::$method($collect_url, $params);

否则不会去下载 attached_url

你改下提个patch给我呀

owner888 avatar Jun 14 '22 05:06 owner888

大哥!!!我真的太感谢你了!!
我找半天 总是找不到为什么加载了 下载不了,只能下载主页,原来是代码有问题! @owner888 群主啊,你害人不清啊!! 虽然你的代码节省了我们大量时间,你好歹也测试下啊。 我搞了3天3夜没找到原因

kavt avatar Jul 31 '22 06:07 kavt

@yingzheng1980 还有个问题,有的详情页有分页 有的没有 如何判断呢

` 'fields' => array(

    array(
        'name' => "contents",
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),`

kavt avatar Jul 31 '22 06:07 kavt

http://www.qikan.com.cn/articleinfo/dinb20222801-2.html

kavt avatar Jul 31 '22 06:07 kavt

改了也不对, 1:没有底部分页的会自动无视 跳过 'selector' => "//div[contains(@class,'art-pre')]/a/@href", 因为有的网页没有分页 底部就没有下一页 没有这个就不会被执行采集

2:有分页的 只采集下一页等几个内容页,当前页内容并没有被采集

`<?php require_once DIR . '/../autoloader.php'; use phpspider\core\phpspider; use phpspider\core\requests; use phpspider\core\selector; /* Do NOT delete this comment / / 不要删除这段注释 */

/【重要 模拟登录】/ $cookies = "ASP.NET_SessionId=uqbyzahwaa5fedgldcawsogx; Hm_lvt_782a719ae16424b0c7041b078eb9804a=1657892367,1658402814,1658581663,1658932747; Hm_lvt_29f14b13cac2f8b4e5fc964806f3ea52=1657892367,1658402820,1658581663,1658932747; Hm_lpvt_782a719ae16424b0c7041b078eb9804a=1658932755; Hm_lpvt_29f14b13cac2f8b4e5fc964806f3ea52=1658932755; UserToken=nrbjZ+ZFD3ulIoEX50957cwO1CrVaO5/NLAFj6bcy1Gx6rsh; LoginUserName=kavt12; LoginPassword=NRW/PSbsXFo=";

requests::set_cookies($cookies, 'www.qikan.com.cn');

/*[for scan_urls 计算出年份和周数 每次请求加上即可] *

取当年份
取当周数

*/

//周数

$year=date('Y');
$week = date('W'); //电脑报一般周一下午出 周数-2 $week=$week-2;

//die;

//7 主要尝试增加分页

$configs = array( 'name' => 'diannaobao', 'log_show' => true, 'max_fields' => 2, //最大采集2条 每次 'domains' => array( 'www.qikan.com.cn' ),

//入口



'scan_urls' => array(
    "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html"   //  http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),





//内容 也对了
 'content_url_regexes' => array(
        "http://www.qikan.com.cn/article/[\s\S]+",  //http://www.qikan.com.cn/article/dinb20222701.html
        "http://www.qikan.com.cn/articleinfo/[\s\S]+"
    ),





'fields' => array(



    array(
        'name' => "contents",
        //'selector_type' => 'regex',
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),
    ),




    // 抽取内容页的文章标题
    array(
        'name' => "title",
        'selector' => "//div[contains(@class,'article')]//h1",   //     备用//*[@id=\"form1\"]/div[6]/div/div[2]/div[1]/h1
        'required' => true
    ),


    //正文 //div[contains(@class,'textWrap')]

/* array( 'name' => "text", 'selector' => "//div[contains(@class,'textWrap')]", ///article-main //div[contains(@class,"article-content")]', 内容部分 //div[contains(@class,'textWrap')] ), */

/*

        array(
        'name' => "contents",
        'selector' => "//html",  ////div[@id='art-pre']//a//@href
        'repeated' => true,
        'children' => array(
                array(
                    // 抽取出其他分页的url待用
                    'name' => 'content_page_url',
                    'selector' => "div[contains(@class,'art-pre')]//a//@href"   ////div[contains(@class,'art-pre')]//a//@href
                ),
                array(
                    // 抽取其他分页的内容
                    'name' => 'page_content',
                    // 发送 attached_url 请求获取其他的分页数据
                    // attached_url 使用了上面抓取的 content_page_url
                    'source_type' => 'attached_url',
                    'attached_url' => 'http://www.qikan.com.cn/{content_page_url}',  //"https://www.zhihu.com/r/answers/{comment_id}/comments",  http://www.qikan.com.cn/
                    'selector' => "//div[contains(@class,'textWrap')]"
                )
        )
    ),

*/

    //图片
     array(
        'name' => "pic",
        'selector' => "//figure[contains(@class,'image')]//img",  ///html/body/div[1]/div[3]/div[1]/div/div[2]/div[1]/div[1]/div[5]  

        //返回的是图片数组 需要取一个出来 $data=$data[0];  估计要做个判断 一个就直接显示 多个就显示第一个   目前看来不处理也可以的
    ),



      


),

'export' => array(
    'type'  => 'sql',
    'file'  => './data/8.sql',
    'table' => '数据表',
),

);

$spider = new phpspider($configs);

//【如何对采集到的字段进行二次处理?】 on_extract_field进行二次处理即可 $spider->on_extract_field = function($fieldname, $data, $page) { if ($fieldname == 'contents') {

                  $contents = $data;
                  $data = "";

                  $num=count($contents)-1;

                  for ($i=0; $i <$num ; $i++) { 
                  	 $data .= $contents[$i]['page_content'];
                  }

                  /*
                  foreach ($contents as $content) 
                  {
                      $data .= $content['page_content'];
                  }
              */

              
}
return $data;

};

$spider->start();`

kavt avatar Jul 31 '22 08:07 kavt