qidian.com - can't scrap the number on the website (special fonts)

Open AlexZenghuashan opened this issue 7 years ago • 6 comments

Troubleshooting

Describe your environment

Operating system:
Python version:
Hardware:
Internet access:
Jupyter notebook or not? [Y/N]: Y
Which chapter of book?:

Describe your question

I can't scrap the number about how many words the novel have. The url: https://www.qidian.com/all?chanId=2&subCateId=5&size=1&action=0&orderId=&vip=0&month=3&update=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1

The minimum code (snippet) to reproduce the issue

import requests from bs4 import BeautifulSoup url = 'https://www.qidian.com/all?chanId=2&subCateId=5&size=1&action=0&orderId=&vip=0&month=3&update=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1' r=requests.get(url) mypage=BeautifulSoup(r.text) mypage import json json.dumps(a[45].find('span').text) json.dumps(a[48].find('span').text)

Nov 16 '18 13:11 AlexZenghuashan

They use a special font to display the numbers. However, those characters are not regular numbers.

Need to find a way to "decode" numbers.

Nov 16 '18 13:11 hupili

This is too hard for our students. Here's the quick solution. It is better to study with some other students together:

https://github.com/hupili/python-for-data-and-media-communication/blob/master/scraper-examples/Qidian%20wordcount.ipynb

Nov 16 '18 14:11 hupili

Thank you!

Nov 17 '18 01:11 AlexZenghuashan

do i need to install something here?

Nov 21 '18 08:11 AlexZenghuashan

Could you tell me what special modules you used? Thank you!

Nov 21 '18 12:11 AlexZenghuashan

wget is not a module. It is a Linux/ Unix command. You need to search how to install this tool on your operating system.

Nov 21 '18 13:11 hupili