internal-displacement
internal-displacement copied to clipboard
Python process to check for new URLs and run the pipeline on them
We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.
Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleep
ing and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.
I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction.
Perhaps we could then have two processes - one that looks for urls with status "New" and executes the Scraping code. The other one would look for urls with status "Fetched" and execute the remaining Classification / Report Extraction piece.