internal-displacement icon indicating copy to clipboard operation
internal-displacement copied to clipboard

Extract document details from PDF

Open georgerichardson opened this issue 8 years ago • 10 comments

Can we extract items such as the title and date published from a pdf?

georgerichardson avatar Feb 08 '17 01:02 georgerichardson

Hmm, I think it's possible. It might involve going to PyPDF2 and/or using regex after parsing. Then again, it all depends on what the pdf file has to give. I'm not sure if I'll have time to dig into this myself.

coldfashioned avatar Feb 08 '17 16:02 coldfashioned

Title looks hard. For published date, what if we use the Last-Modified field from the response headers?

We could put this in the get_pdf function at the same time as making sure the page exists:

try:
    response = request.urlopen(url) # not sure if this is needed?
    pub_date = response.getheader('Last-Modified')

However, this value would then have to passed back to get_body_text and from there back to pdf_article in order to be saved along with the other article data.

Supposedly the HTTP Headers dates must all follow a standard format so should be pretty simple to parse.

simonb83 avatar Feb 09 '17 00:02 simonb83

Someone working on it?

Guilhermeslucas avatar Feb 11 '17 00:02 Guilhermeslucas

Hey @Guilhermeslucas, we have implemented a solution for extracting the published date, but haven't figured out how to get the title yet, if you're interested in looking into it :-)

simonb83 avatar Feb 11 '17 00:02 simonb83

I'm working right now on #4 , cause it's more beginner friendly. I'll let you know when I finish that issue. Thanks!

Guilhermeslucas avatar Feb 13 '17 16:02 Guilhermeslucas

Hey I just saw this issue about extracting details from a PDF. Since we have the updated schema, I'm just wondering if we should pick this up again. I came across PyPDF2. I will try it out for a bit.

domingohui avatar Mar 12 '17 22:03 domingohui

Sounds good. Yes, based on the new pipeline (see process_url in PR_107#pipeline) it would be good if scraper.scrape() returns as much detail as possible for pdfs.

simonb83 avatar Mar 15 '17 03:03 simonb83

Looks like scraper is already using textract which uses pdfminer which is similar to PyPDF2. But they both seem to have difficulties extracting things like titles. Content (text) wise, theres's no problem. I will keep trying things out.

domingohui avatar Mar 15 '17 03:03 domingohui

@domingohui Have you had any luck with different approaches on this one?

georgerichardson avatar May 04 '17 18:05 georgerichardson

I haven't found a tool that can handle this reasonably well...

domingohui avatar May 05 '17 00:05 domingohui