Add executor to zimscraperlib
This PR enrich the scraperlib with a ScraperExecutor. This class is capable to process tasks in parallel, with a given number of worker threads.
This executor is mainly inspired from sotoki executor, even if we can find other executors in wikihow and in iFixit. wikihow one seems more primitive / ancient, and iFixit is just a pale copy.
For easy review, first commit is simply a copy/paste of sotoki code, and next commit are the adaptations / enhancement for scraperlib
What has been changed compared to sotoki code:
- commit https://github.com/openzim/python-scraperlib/pull/211/commits/ae8edb74e7c8e27d6beba36a96d43786670df5b1:
- automated unit tests obviously
- moved
thread_deadline_secto the executor, should we need to customize it per executor (probably the case, priceless and useful for tests at least) - added a check
if self.no_more:insubmitmethod: allows to stop accepting task even when the executor is justjoinedand notshutdown - renamed
prefixtoexecutor_nameand moved fromT-toexecutor(way more clear in the logs from my experience) - removed the
release_haltmethod which was misleading / not working (I failed tojoinand thenrelease_haltand thensubmitagain ... it seems mandatory tojointhenstart(again) thensubmit)
- commit https://github.com/openzim/python-scraperlib/pull/211/commits/fd5c04ab3e675224aa67daf5d7326bbc8d72f659
- changes in join method: in sotoki, the executor wait
thread_deadline_secseconds per thread. This is highly unpredictable when there are many workers (we could waitthread_deadline_secfor first worker, thenthread_deadline_secfor second worker, etc ...), and it is a bit weird that last worker in the list has way more time to complete than first one - new method computes a global deadline for all threads to join, and immediately request all of them to join (should they already be ready to join)
- changes in join method: in sotoki, the executor wait
- commit https://github.com/openzim/python-scraperlib/pull/211/commits/0ce636cb825155ce839c63924630ebaf594eda02
- just a standard log displaying the thread name, useful to use same notation / format in all scrapers even when we want to display the thread name (should be quite common in fact)
This executor will be used right now in mindtouch scraper.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 100.00%. Comparing base (
ea6505f) to head (0ce636c).
Additional details and impacted files
@@ Coverage Diff @@
## main #211 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 38 39 +1
Lines 2221 2327 +106
Branches 426 446 +20
==========================================
+ Hits 2221 2327 +106
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Converting to draft, we are experimenting with joblib in mindtouch scraper for now