weibo-search 爬取数量不一样

作者大大，为什么当我把爬取时间设定为2021-02-01到2021-03-01，爬取到的微博数量远远小于按一天天，比如2021-02-01到2021-02-02爬取的数量呢

Nov 06 '23 06:11 JiaHongmei

因为搜索结果限制47-50页，如果超过这个数量，不会返回结果。（即使手动访问 s.weibo.com 也只能搜到 47-50 页）建议拆分搜索时间段，按每天获取。

Nov 06 '23 08:11 cloudy-sfu

我试了按天爬取，但是爬到一个时间点后，它就开始重复爬取相同评论

Nov 06 '23 08:11 JiaHongmei

#66

Nov 06 '23 08:11 cloudy-sfu

def parse_by_hour(self, response): """以小时为单位筛选""" keyword = response.meta.get('keyword') is_empty = response.xpath( '//div[@class="card card-no-result s-pt20b40"]') if is_empty: print('当前页面搜索结果为空') else: # 解析当前页面 for weibo in self.parse_weibo(response): self.check_environment() yield weibo next_url = response.xpath( '//a[@class="next"]/@href').extract_first() if next_url: next_url = self.base_url + next_url yield scrapy.Request(url=next_url, callback=self.parse_page, meta={'keyword': keyword})

我跟着改了，但是爬取的微博的发布时间有点不太对吧 2022/3/2 10:00

2022/3/2 9:59 2022/3/2 9:59 2022/3/2 9:59 2022/3/2 9:59 2022/3/2 9:59 2022/3/2 9:59 2022/3/2 9:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 8:59 2022/3/2 7:59 2022/3/2 7:59 2022/3/2 7:59 2022/3/2 7:59 2022/3/2 7:59 2022/3/2 7:59 2022/3/2 6:59 2022/3/2 6:59 2022/3/2 6:59 2022/3/2 6:58 2022/3/2 6:58

Nov 06 '23 08:11 JiaHongmei

就爬取结果来说，看起来没问题，它是并行的，所以分别从每个小时最后一分钟往前。

Nov 06 '23 08:11 cloudy-sfu

def parse_by_hour(self, response):
    """以小时为单位筛选"""
    keyword = response.meta.get('keyword')
    is_empty = response.xpath(
        '//div[@class="card card-no-result s-pt20b40"]')
    if is_empty:
        print('当前页面搜索结果为空')
    else:
        # 解析当前页面
        for weibo in self.parse_weibo(response):
            self.check_environment()
            yield weibo
        next_url = response.xpath(
            '//a[@class="next"]/@href').extract_first()
        if next_url:
            next_url = self.base_url + next_url
            yield scrapy.Request(url=next_url,
                                 callback=self.parse_page,
                                 meta={'keyword': keyword})

def parse_by_hour_province(self, response):

Nov 06 '23 08:11 JiaHongmei

几种可能：（1）搜索结果就返回了1页，请检查最大返回页数的设置，是不是仍然在46-50之间，而不是1页。（2）把范围缩小到3个小时，等待运行完，是否能爬到一个小时较早的时间，是否返回满50页的结果。（3）手动访问一下 s.weibo.com 看是不是遇到验证码等人机验证，如果是，建议重新登陆，更新 cookies.

如果以上问题都没有，请汇报你无法解决问题，将在有空的时候 debug.

Nov 06 '23 09:11 cloudy-sfu