python-for-data-and-media-communication-gitbook
python-for-data-and-media-communication-gitbook copied to clipboard
How to scrape PDF from CNKI
Troubleshooting
Describe your environment
- Operating system: mac
- Python version: 3.7
- Hardware: macbook
- Internet access:
- Jupyter notebook or not? [Y/N]: Y
- Which chapter of book?: chapter 6
Describe your question
I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?
- Website: (one article) http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB200801010026&dbname=CCND2008&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1YXREREQva29YUjBMb0hPUG15bXpFaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
- CNKI: http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB

Describe the efforts you have spent on this issue
I found this website about pdf scraping https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs.
I followed the instruction in the article. But it seemed that I cannot install textract at the beginning.


@hupili
@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like filereader ocr
We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 ) Do I need to download all the PDF before scraping? @ChicoXYC
@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf: https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/extract-words-with-pdf.ipynb
I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb
Don't dwell on textract
. Actually PyPDF2
alone already works as shown in this comment
But there is an error called
Multiple definitions in dictionary at byte 0x7eb1 for key /MediaBox
@ChicoXYC
Besides of the above reading error, I cannot click the link of "download pdf" in the webpage
(and it seems that when I paste the link, it will download caj instead of pdf)
http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click()
@ChicoXYC Can you also help me with that ?
@lullabymia You need use selenium instead of requests method to find elements. The following will help
import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click()
@lullabymia
import PyPDF2
import os
path='pdfs/' #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
Clicking pdf worked and thank you! But the error called "Multiple definitions in dictionary at byte 0xcd34 for key /MediaBox" appeared again while reading pdf files. :(
@ChicoXYC
@lullabymia
- checkout whether only pdf files are in the folder.
- plz don't leave blank space in the folder name.
- If it doesn't work, you can find me in the cva 808 to work out face to face.
redirect to #135