Simon Bedford comments

Results 8 comments of


                                            Simon Bedford

Generate a reliability score for a given article

I'm not really sure yet, but likely some sort of combination. Probably initially some hard coded rules to generate a preliminary score that can then be verified by an analyst...

Scraper - Tag content type

For video / image, if there is no accompanying text on the page, we are likely to end up tagging the link as not relevant as the idea is to...

Deal with datetime issue

Yeah agreed None makes more sense now that I think about it.

Python process to check for new URLs and run the pipeline on them

I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction. Perhaps we...

Extract document details from PDF

Title looks hard. For published date, what if we use the `Last-Modified` field from the response headers? We could put this in the `get_pdf` function at the same time as...

Extract document details from PDF

Hey @Guilhermeslucas, we have implemented a solution for extracting the published date, but haven't figured out how to get the title yet, if you're interested in looking into it :-)

Extract document details from PDF

Sounds good. Yes, based on the new pipeline (see `process_url` in [PR_107#pipeline](https://github.com/Data4Democracy/internal-displacement/pull/107/files#diff-8a311ad97f84ada36b61b16574d51398)) it would be good if `scraper.scrape()` returns as much detail as possible for pdfs.

Deal with timeouts when scraping

Yeah I got this too. For the purposes of the submission I just caught the error and moved on to the next url, but I wonder if it makes sense...