python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Add executor to zimscraperlib

Open benoit74 opened this issue 1 year ago • 2 comments

This PR enrich the scraperlib with a ScraperExecutor. This class is capable to process tasks in parallel, with a given number of worker threads.

This executor is mainly inspired from sotoki executor, even if we can find other executors in wikihow and in iFixit. wikihow one seems more primitive / ancient, and iFixit is just a pale copy.

For easy review, first commit is simply a copy/paste of sotoki code, and next commit are the adaptations / enhancement for scraperlib

What has been changed compared to sotoki code:

  • commit https://github.com/openzim/python-scraperlib/pull/211/commits/ae8edb74e7c8e27d6beba36a96d43786670df5b1:
    • automated unit tests obviously
    • moved thread_deadline_sec to the executor, should we need to customize it per executor (probably the case, priceless and useful for tests at least)
    • added a check if self.no_more: in submit method: allows to stop accepting task even when the executor is just joined and not shutdown
    • renamed prefix to executor_name and moved from T- to executor (way more clear in the logs from my experience)
    • removed the release_halt method which was misleading / not working (I failed to join and then release_halt and then submit again ... it seems mandatory to join then start (again) then submit)
  • commit https://github.com/openzim/python-scraperlib/pull/211/commits/fd5c04ab3e675224aa67daf5d7326bbc8d72f659
    • changes in join method: in sotoki, the executor wait thread_deadline_sec seconds per thread. This is highly unpredictable when there are many workers (we could wait thread_deadline_sec for first worker, then thread_deadline_sec for second worker, etc ...), and it is a bit weird that last worker in the list has way more time to complete than first one
    • new method computes a global deadline for all threads to join, and immediately request all of them to join (should they already be ready to join)
  • commit https://github.com/openzim/python-scraperlib/pull/211/commits/0ce636cb825155ce839c63924630ebaf594eda02
    • just a standard log displaying the thread name, useful to use same notation / format in all scrapers even when we want to display the thread name (should be quite common in fact)

This executor will be used right now in mindtouch scraper.

benoit74 avatar Nov 05 '24 13:11 benoit74

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (ea6505f) to head (0ce636c).

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #211    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           38        39     +1     
  Lines         2221      2327   +106     
  Branches       426       446    +20     
==========================================
+ Hits          2221      2327   +106     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 05 '24 13:11 codecov[bot]

Converting to draft, we are experimenting with joblib in mindtouch scraper for now

benoit74 avatar Nov 08 '24 13:11 benoit74