mwoffliner Put scraping logic somewhere outside the main loop (option 4)

Related to #1043

(option 4) We're trying to gather the clues for #1043. Here is my guess for another approach.

The idea is to extract the logic that actually perform the scrap, and put it somewhere outside the main node loop. That would be another container or a just thread of node worker. Call it with the articleId and get the result back. The result is the Wiki Article (the json + all its deps and files). So, that worker could produce the wiki article as a single directory with a few files, located - for example - on a small storage shared between workers and the main process. I'm sure that we couldn't share opened zim file between node processes. So, the main process will watch that storage and, once it detect a new directory (named by articleId, for example) with wiki article content (and a semaphore file like .done), it (main process) will pick up that folder as wiki article, put the files into opened zim and remove the folder from the storage. Due to transit nature, that storage would have a small size, so that it could run in RAM, why not.

Next, we consider main process as orchestrator that:

prepare everything;
maintains article list in redis;
call the workers with fixed set of articles to grab (10-20 as defined);
track what's done and populate the list of remainings in redis;
continue to call workers accordingly;
watch if there's no new data from the worker within a time window, and kill it;
pick up the results as described above and put them into the zim;
finalize everything.

The KEY differences here are:

we don't mind about workers lifetime because they're isolated and have stateless nature.
main process become a single responsibility point that doesn't perform actual complex work so that we could assume that it won't stuck :)

Any thoughts?

May 08 '20 08:05 midik

@midik I don't understand which problem this should fix.

May 08 '20 10:05 kelson42

here are same intentions as in option 3, just a bit more generalized approach

May 08 '20 10:05 midik

@midik I don’t want to build a dependence to Docker in MWoffliner. I’m not really wanting to make things far more complicated to run for the users. Using different processes and let them deal with each other is not always easy/fast.

That said, I believe, like you, that this is not a good architecture to have so many CPU intensive things within the main loop. This does not scale properly even on a nice async system like Node.ja.

What do you think about deporting image and HTML treatments, before writting to ZIM, to workers? See https://nodejs.org/api/worker_threads.html? I think this might be a nice path to follow.

Jun 14 '20 15:06 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Aug 14 '20 08:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jan 14 '21 05:01 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

May 26 '23 17:05 stale[bot]

mwoffliner mwoffliner copied to clipboard

Put scraping logic somewhere outside the main loop (option 4)

mwoffliner
mwoffliner copied to clipboard