Troubleshooting

Describe your environment

Operating system: mac
Python version: 3.7
Hardware: macbook
Internet access:
Jupyter notebook or not? [Y/N]: Y
Which chapter of book?: chapter 6

Describe your question

I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?

Website: (one article) http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB200801010026&dbname=CCND2008&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1YXREREQva29YUjBMb0hPUG15bXpFaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
CNKI: http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB

Describe the efforts you have spent on this issue

I found this website about pdf scraping https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

Nov 30 '18 09:11 lullabymia

@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs.

Nov 30 '18 15:11 hupili

I followed the instruction in the article. But it seemed that I cannot install textract at the beginning. 2018-12-01 4 37 42

@hupili

Dec 01 '18 08:12 lullabymia

@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like filereader ocr

Dec 01 '18 09:12 ChicoXYC

We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 ) Do I need to download all the PDF before scraping? @ChicoXYC

Dec 01 '18 12:12 lullabymia

@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf: https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/extract-words-with-pdf.ipynb

Dec 01 '18 16:12 ChicoXYC

I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb

Dec 01 '18 16:12 ChicoXYC

Don't dwell on textract. Actually PyPDF2 alone already works as shown in this comment

Dec 03 '18 15:12 hupili

But there is an error called Multiple definitions in dictionary at byte 0x7eb1 for key /MediaBox 2018-12-10 3 22 43 2018-12-10 3 23 01 @ChicoXYC

Dec 10 '18 07:12 lullabymia

Besides of the above reading error, I cannot click the link of "download pdf" in the webpage (and it seems that when I paste the link, it will download caj instead of pdf) http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!! 2018-12-12 1 42 06

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click()

@ChicoXYC Can you also help me with that ?

Dec 12 '18 05:12 lullabymia

@lullabymia You need use selenium instead of requests method to find elements. The following will help

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)

link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click()

Dec 12 '18 10:12 ChicoXYC

@lullabymia

import PyPDF2
import os
path='pdfs/'   #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
    pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())

Dec 12 '18 10:12 ChicoXYC

Clicking pdf worked and thank you! But the error called "Multiple definitions in dictionary at byte 0xcd34 for key /MediaBox" appeared again while reading pdf files. :( 2018-12-12 7 38 21 2018-12-12 7 38 50 2018-12-12 7 39 13

@ChicoXYC

Dec 12 '18 11:12 lullabymia

@lullabymia

checkout whether only pdf files are in the folder.
plz don't leave blank space in the folder name.
If it doesn't work, you can find me in the cva 808 to work out face to face.

Dec 12 '18 12:12 ChicoXYC

redirect to #135

Dec 14 '18 16:12 hupili

python-for-data-and-media-communication-gitbook
python-for-data-and-media-communication-gitbook copied to clipboard

How to scrape PDF from CNKI

Troubleshooting

Describe your environment

Describe your question

Describe the efforts you have spent on this issue

python-for-data-and-media-communication-gitbook python-for-data-and-media-communication-gitbook copied to clipboard

How to scrape PDF from CNKI

Troubleshooting

Describe your environment

Describe your question

Describe the efforts you have spent on this issue

python-for-data-and-media-communication-gitbook
python-for-data-and-media-communication-gitbook copied to clipboard