internal-displacement Extract document details from PDF

Can we extract items such as the title and date published from a pdf?

Feb 08 '17 01:02 georgerichardson

Hmm, I think it's possible. It might involve going to PyPDF2 and/or using regex after parsing. Then again, it all depends on what the pdf file has to give. I'm not sure if I'll have time to dig into this myself.

Feb 08 '17 16:02 coldfashioned

Title looks hard. For published date, what if we use the Last-Modified field from the response headers?

We could put this in the get_pdf function at the same time as making sure the page exists:

try:
    response = request.urlopen(url) # not sure if this is needed?
    pub_date = response.getheader('Last-Modified')

However, this value would then have to passed back to get_body_text and from there back to pdf_article in order to be saved along with the other article data.

Supposedly the HTTP Headers dates must all follow a standard format so should be pretty simple to parse.

Feb 09 '17 00:02 simonb83

Someone working on it?

Feb 11 '17 00:02 Guilhermeslucas

Hey @Guilhermeslucas, we have implemented a solution for extracting the published date, but haven't figured out how to get the title yet, if you're interested in looking into it :-)

Feb 11 '17 00:02 simonb83

I'm working right now on #4 , cause it's more beginner friendly. I'll let you know when I finish that issue. Thanks!

Feb 13 '17 16:02 Guilhermeslucas

Hey I just saw this issue about extracting details from a PDF. Since we have the updated schema, I'm just wondering if we should pick this up again. I came across PyPDF2. I will try it out for a bit.

Mar 12 '17 22:03 domingohui

Sounds good. Yes, based on the new pipeline (see process_url in PR_107#pipeline) it would be good if scraper.scrape() returns as much detail as possible for pdfs.

Mar 15 '17 03:03 simonb83

Looks like scraper is already using textract which uses pdfminer which is similar to PyPDF2. But they both seem to have difficulties extracting things like titles. Content (text) wise, theres's no problem. I will keep trying things out.

Mar 15 '17 03:03 domingohui

@domingohui Have you had any luck with different approaches on this one?

May 04 '17 18:05 georgerichardson

I haven't found a tool that can handle this reasonably well...

May 05 '17 00:05 domingohui

internal-displacement internal-displacement copied to clipboard

Extract document details from PDF

internal-displacement
internal-displacement copied to clipboard