internal-displacement
internal-displacement copied to clipboard
Extract document details from PDF
Can we extract items such as the title and date published from a pdf?
Hmm, I think it's possible. It might involve going to PyPDF2 and/or using regex after parsing. Then again, it all depends on what the pdf file has to give. I'm not sure if I'll have time to dig into this myself.
Title looks hard. For published date, what if we use the Last-Modified
field from the response headers?
We could put this in the get_pdf
function at the same time as making sure the page exists:
try:
response = request.urlopen(url) # not sure if this is needed?
pub_date = response.getheader('Last-Modified')
However, this value would then have to passed back to get_body_text
and from there back to pdf_article
in order to be saved along with the other article data.
Supposedly the HTTP Headers dates must all follow a standard format so should be pretty simple to parse.
Someone working on it?
Hey @Guilhermeslucas, we have implemented a solution for extracting the published date, but haven't figured out how to get the title yet, if you're interested in looking into it :-)
I'm working right now on #4 , cause it's more beginner friendly. I'll let you know when I finish that issue. Thanks!
Hey I just saw this issue about extracting details from a PDF. Since we have the updated schema, I'm just wondering if we should pick this up again. I came across PyPDF2. I will try it out for a bit.
Sounds good. Yes, based on the new pipeline (see process_url
in PR_107#pipeline) it would be good if scraper.scrape()
returns as much detail as possible for pdfs.
Looks like scraper is already using textract which uses pdfminer which is similar to PyPDF2. But they both seem to have difficulties extracting things like titles. Content (text) wise, theres's no problem. I will keep trying things out.
@domingohui Have you had any luck with different approaches on this one?
I haven't found a tool that can handle this reasonably well...