antispider
antispider copied to clipboard
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 918: illegal multibyte sequence
运行脚本,报错信息如下,大神帮忙看下是什么原因?  ### 代码定位 

汽车之家口碑更新了混淆逻辑,不能用了,求大神更新代码,,
爬汽车之家的SUV车型时程序会报错,index out of range。 排查发现因为SUV是加密关键词,但是是个英文关键词所以没有URL转义。所以不能被正则抓取,导致字典长度少了3,所以在执行中索引会溢出字典导致错误。 例如 ```python res = requests.get("http://car.autohome.com.cn/config/spec/1646.html") res.encoding = 'gb18030' item = get_params(res.text) print json.dumps(item, ensure_ascii=False, indent=4) ``` 其中反混淆得到的Js如下,SUV作为前三个字符因为没有采用%xx的形式没被抓到。 ``` SUV%E4%B8%87%E4%B8%AD%E4%BA%AC%E4%BB%B7%E4%BC%98%E4%BD%93%E4%BE%9B%E4%BF%9D%E5%85%83%E5%85%A8%E5%87%86%E5%87%91%E5%88%97%E5%88%B6%E5%89%8D%E5%8A%9B%E5%8A%9F%E5%8A%A8%E5%8A%A9%E5%8C%97%E5%8D%8E%E5%8E%8B%E5%8F%B7%E5%90%88%E5%90%8D%E5%90%8E%E5%90%B8%E5%95%86%E5%96%B7%E5%99%A8%E5%9C%B0%E5%9E%8B%E5%A4%87%E5%A4%9A%E5%A4%A7%E5%A4%AE%E5%AD%90%E5%AE%9A%E5%AE%9E%E5%AE%B9%E5%AE%BD%E5%AF%B8%E5%AF%BC%E5%B0%BA%E5%B7%AE%E5%B9%B4%E5%BA%A6%E5%BC%8F%E5%BC%B9%E5%BE%84%E5%BE%B7%E6%82%AC%E6%88%96%E6%89%AD%E6%89%BF%E6%8C%87%E6%8E%92%E6%95%B0%E6%95%B4%E6%9C%80%E6%9C%BA%E6%9D%86%E6%9E%84%E6%9E%B6%E6%A0%87%E6%A0%BC%E6%A2%B0%E6%AC%A7%E6%AF%94%E6%B0%94%E6%B2%B9%E6%B5%8B%E6%B6%B2%E7%82%B9%E7%84%B6%E7%87%83%E7%8B%AC%E7%8E%87%E7%8E%AF%E7%94%B5%E7%9B%96%E7%9B%98%E7%9F%A9%E7%A6%BB%E7%A7%AF%E7%A7%B0%E7%A8%8B%E7%A8%B3%E7%AB%8B%E7%AE%B1%E7%B0%A7%E7%B4%A7%E7%BB%BC%E7%BC%A9%E7%BC%B8%E7%BD%AE%E8%80%97%E8%83%8E%E8%87%AA%E8%93%9D%E8%A1%8C%E8%A7%84%E8%B1%AA%E8%B4%A8%E8%B7%9D%E8%BD%A6%E8%BD%AC%E8%BD%AE%E8%BD%B4%E8%BD%BD%E8%BF%9B%E8%BF%9E%E9%80%9A%E9%80%9F%E9%85%8D%E9%87%8F%E9%93%81%E9%93%9D%E9%95%BF%E9%97%A8%E9%97%B4%E9%9A%99%E9%9B%85%E9%A3%8E%E9%A9%B1%E9%A9%BB%E9%AB%98%E9%BC%93C% ``` 我怀疑里面的英文字母也会有问题。建议把这个问题修一修,改一下正则。
The 270th line in `antispider/autohome.py` ``` # 获取所有变量 var_regex = "var\s+(\w+)=(.*?);\s" ``` should be: ``` # 获取所有变量 var_regex = "var\s+(\w+)\s*=\s*([\'\"].*?[\'\"]);\s*" ``` Since the following case exists. `var bs_=';12';` Thanks for...