internal-displacement
internal-displacement copied to clipboard
Studying news events and internal displacement.
Can we extract items such as the title and date published from a pdf?
Take approach from `classification` notebook and integrate into interpreter for classification and filtering articles.
During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.
This code in `master` breaks production: ``` //if not using docker //create a pgConfig.js file in the same directory and put your credentials there const connectionObj = require('./pgConfig'); ``` ```...
The `docker-compose.yml` and `docker.env` files are currently set up with local development in mind. We'll want a production-friendly config. - Don't run localdb - DB config refers to AWS RDS...
Write a function that calculates the percentage of missing fields in `report.Report` after an article has been interpreted. We may expand this later to include weighting or other factors. Discussion...
Here's a sketch of an infrastructure plan: ## Development Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo) Write to local DB in docker Can read scrape...
In `Pipeline.process_url` we make multiple calls to `article.update_status()`. The update_status method may raise `UnexpectedArticleStatusException` if it appears that the status has been changed in the meantime. `process_url` should be prepared...
Make sure pipeline is working with pdf articles for different scenarios: - Non existent / broken url - Non English - Irrelevant - Relevant Ideally include some tests in `tests/test_Pipeline.py`
Write a function in `article.Article` that calculates the percentage of scraped fields which are returned empty. We may consider expanding the definition of scraping reliability later, so suggestions welcome.