internal-displacement Python process to check for new URLs and run the pipeline on them

Python process to check for new URLs and run the pipeline on them

Open WanderingStar opened this issue 7 years ago • 1 comments

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

Apr 02 '17 19:04 WanderingStar

I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction.

Perhaps we could then have two processes - one that looks for urls with status "New" and executes the Scraping code. The other one would look for urls with status "Fetched" and execute the remaining Classification / Report Extraction piece.

May 17 '17 23:05 simonb83

internal-displacement internal-displacement copied to clipboard

Python process to check for new URLs and run the pipeline on them

internal-displacement
internal-displacement copied to clipboard