phpspider icon indicating copy to clipboard operation
phpspider copied to clipboard

请问是否支持直接爬取 内容页url

Open p0h5 opened this issue 7 years ago • 3 comments

请问能够提前批量生成内容页的url,再加入爬取队列?

p0h5 avatar Apr 20 '17 02:04 p0h5

支持的,首先设置好内容页规则,比如: 'content_url_regexes' => array( "http://www.mafengwo.cn/i/\d+.html", ), 然后在on_scan_page里面批量生成内容页url $spider->on_scan_page = function($page, $content, $phpspider) { for ($i = 0; $i < 1000; $i++) { $url = "http://www.mafengwo.cn/i/{$i}.html"; $phpspider->add_url($url); } };

owner888 avatar Apr 27 '17 09:04 owner888

如果内容页并没有在入口页面或者列表页面呢,我只想批量生成内容页面url,然后爬虫挨个爬内容

p0h5 avatar Apr 28 '17 06:04 p0h5

add_url函数做点小调整就可以。新增一个$force_content参数,调用add_url函数时设置该参数为true。内容页、列表页规则都留空即可。

    public function add_url($url, $options = array(), $depth = 0, $force_content = false)
    {
        // 投递状态
        $status = false;

        $link = $options;
        $link['url'] = $url;
        $link['depth'] = $depth;
        $link = $this->link_uncompress($link);

        if ($this->is_list_page($url))
        {
            $link['url_type'] = 'list_page';
            $status = $this->queue_lpush($link);
        }

        if ($this->is_content_page($url) || $force_content)
        {
            $link['url_type'] = 'content_page';
            $status = $this->queue_lpush($link);
        }

        if ($status)
        {
            if ($link['url_type'] == 'scan_page')
            {
                log::debug("Find scan page: {$url}");
            }
            elseif ($link['url_type'] == 'list_page')
            {
                log::debug("Find list page: {$url}");
            }
            elseif ($link['url_type'] == 'content_page')
            {
                log::debug("Find content page: {$url}");
            }
        }

        return $status;
    }

eddy8 avatar Sep 07 '17 01:09 eddy8