python-for-data-and-media-communication-gitbook icon indicating copy to clipboard operation
python-for-data-and-media-communication-gitbook copied to clipboard

How to scrape PDF from CNKI

Open lullabymia opened this issue 6 years ago • 14 comments

Troubleshooting

Describe your environment

  • Operating system: mac
  • Python version: 3.7
  • Hardware: macbook
  • Internet access:
  • Jupyter notebook or not? [Y/N]: Y
  • Which chapter of book?: chapter 6

Describe your question

I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?

  • Website: (one article) http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB200801010026&dbname=CCND2008&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1YXREREQva29YUjBMb0hPUG15bXpFaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
  • CNKI: http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB
2018-11-30 5 49 28

Describe the efforts you have spent on this issue

I found this website about pdf scraping https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

lullabymia avatar Nov 30 '18 09:11 lullabymia

@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs.

hupili avatar Nov 30 '18 15:11 hupili

I followed the instruction in the article. But it seemed that I cannot install textract at the beginning. 2018-12-01 4 37 42

2018-12-01 4 37 55 2018-12-01 4 41 44

@hupili

lullabymia avatar Dec 01 '18 08:12 lullabymia

@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like filereader ocr

ChicoXYC avatar Dec 01 '18 09:12 ChicoXYC

We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 ) Do I need to download all the PDF before scraping? @ChicoXYC

lullabymia avatar Dec 01 '18 12:12 lullabymia

@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf: https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/extract-words-with-pdf.ipynb

ChicoXYC avatar Dec 01 '18 16:12 ChicoXYC

I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb

ChicoXYC avatar Dec 01 '18 16:12 ChicoXYC

Don't dwell on textract. Actually PyPDF2 alone already works as shown in this comment

hupili avatar Dec 03 '18 15:12 hupili

But there is an error called Multiple definitions in dictionary at byte 0x7eb1 for key /MediaBox 2018-12-10 3 22 43 2018-12-10 3 23 01 @ChicoXYC

lullabymia avatar Dec 10 '18 07:12 lullabymia

Besides of the above reading error, I cannot click the link of "download pdf" in the webpage (and it seems that when I paste the link, it will download caj instead of pdf) http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!! 2018-12-12 1 42 06

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click()

@ChicoXYC Can you also help me with that ?

lullabymia avatar Dec 12 '18 05:12 lullabymia

@lullabymia You need use selenium instead of requests method to find elements. The following will help

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)

link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click()

ChicoXYC avatar Dec 12 '18 10:12 ChicoXYC

@lullabymia

import PyPDF2
import os
path='pdfs/'   #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
    pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())

ChicoXYC avatar Dec 12 '18 10:12 ChicoXYC

Clicking pdf worked and thank you! But the error called "Multiple definitions in dictionary at byte 0xcd34 for key /MediaBox" appeared again while reading pdf files. :( 2018-12-12 7 38 21 2018-12-12 7 38 50 2018-12-12 7 39 13

@ChicoXYC

lullabymia avatar Dec 12 '18 11:12 lullabymia

@lullabymia

  1. checkout whether only pdf files are in the folder.
  2. plz don't leave blank space in the folder name.
  3. If it doesn't work, you can find me in the cva 808 to work out face to face.

ChicoXYC avatar Dec 12 '18 12:12 ChicoXYC

redirect to #135

hupili avatar Dec 14 '18 16:12 hupili