internal-displacement icon indicating copy to clipboard operation
internal-displacement copied to clipboard

Python process to check for new URLs and run the pipeline on them

Open WanderingStar opened this issue 7 years ago • 1 comments

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

WanderingStar avatar Apr 02 '17 19:04 WanderingStar

I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction.

Perhaps we could then have two processes - one that looks for urls with status "New" and executes the Scraping code. The other one would look for urls with status "Fetched" and execute the remaining Classification / Report Extraction piece.

simonb83 avatar May 17 '17 23:05 simonb83