Simon Bedford

Results 8 comments of Simon Bedford

I'm not really sure yet, but likely some sort of combination. Probably initially some hard coded rules to generate a preliminary score that can then be verified by an analyst...

For video / image, if there is no accompanying text on the page, we are likely to end up tagging the link as not relevant as the idea is to...

Yeah agreed None makes more sense now that I think about it.

I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction. Perhaps we...

Title looks hard. For published date, what if we use the `Last-Modified` field from the response headers? We could put this in the `get_pdf` function at the same time as...

Hey @Guilhermeslucas, we have implemented a solution for extracting the published date, but haven't figured out how to get the title yet, if you're interested in looking into it :-)

Sounds good. Yes, based on the new pipeline (see `process_url` in [PR_107#pipeline](https://github.com/Data4Democracy/internal-displacement/pull/107/files#diff-8a311ad97f84ada36b61b16574d51398)) it would be good if `scraper.scrape()` returns as much detail as possible for pdfs.

Yeah I got this too. For the purposes of the submission I just caught the error and moved on to the next url, but I wonder if it makes sense...