EastMoneySpider 终于看见了一个我能看得懂的爬虫。。。虽然有bug

 posts = selector.xpath('//div[@class="articleh normal_post"]')  # + selector.xpath('//div[@class="articleh odd"]')
        
        for index, post in enumerate(posts):
            link = post.xpath('span[@class="l3 a3"]/a/@href').extract()
            if link:
                if link[0].startswith('/'):
                    link = "http://guba.eastmoney.com/" + link[0][1:]
                else:
                    link = "http://guba.eastmoney.com/" + link[0]

                if link in self._existed_urls:
                    continue

            # drop set-top or ad post
            type = post.xpath('span[@class="l3 a3"]/em/@class').extract()
            if type:
                type = type[0]
                if type == 'ad' or type == 'settop' or type == 'hinfo':
                    continue
            else:
                type = 'normal'

            read_count = post.xpath('span[@class="l1 a1"]/text()').extract()
            comment_count = post.xpath('span[@class="l2 a2"]/text()').extract()
            username = post.xpath('span[@class="l4 a4"]/a/font/text()').extract()
            updated_time = post.xpath('span[@class="l5 a5"]/text()').extract()
            print('read_count:', read_count)
            print('comment_count:', comment_count)
            print('username:', username)
            print('updated_time:', updated_time)
            if not read_count or not comment_count or not username or not updated_time:
                print('break')
                continue

            item = PostItem()
            item['stock_id'] = stock_id
            item['read_count'] = int(read_count[0])
            item['comment_count'] = int(comment_count[0])
            item['username'] = username[0].strip('\r\n').strip()
            item['updated_time'] = updated_time[0]
            item['url'] = link

            if link:
                yield Request(url=link, meta={'item': item, 'PhantomJS': True}, callback=self.parse_post)

        if page < self.total_pages:
            stock_id = self.stock_id
            request = Request(LIST_URL.format(stock_id=self.stock_id, page=page + 1))
            request.meta['stock_id'] = stock_id
            request.meta['page'] = page + 1
            yield request
```

东方股吧的标签变了，
而且你用的LIST_URL也有些问题，目前看来只有上证指数是用的你这里些的LISTURL的格式，我试了下沪深三百，LISTURL不一样，还得做特殊处理。

Mar 21 '20 02:03 anmingyu11

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

Apr 12 '20 12:04 ZHANGM41

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

Apr 14 '20 01:04 anmingyu11

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41

这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

Apr 14 '20 10:04 ZHANGM41

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

Apr 15 '20 02:04 anmingyu11

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

Apr 15 '20 09:04 ZHANGM41

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41

你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

环境配置是什么？能分享一下吗？希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的

Jul 26 '21 14:07 shizhu13

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

Jul 26 '21 14:07 shizhu13

希望大家可以加个微信互相讨论，我会建个群，大家专门讨论爬虫的，我的微信： 876983033

同学您这个问题解决了吗还方便加微信吗

Jun 07 '23 17:06 c976237222

你好这个爬个股吧有bug嘛，我尝试之后存不到数据库上去 ...

@ZHANGM41 这个现在爬不了，你得改，因为股吧的页面结构变了。

谢谢！我改了之后发现好像有反爬，爬了一阵就重新到另一个无关网页了…不知道这个能换ip解决嘛？

@ZHANGM41 你说的没错，有反爬，我用的付费ip代理爬的，单机爬是不可以的。

好的非常感谢!

同学您还有可以使用的代码吗

Jun 10 '23 18:06 c976237222

EastMoneySpider EastMoneySpider copied to clipboard

终于看见了一个我能看得懂的爬虫。。。虽然有bug

EastMoneySpider
EastMoneySpider copied to clipboard