mwoffliner
mwoffliner copied to clipboard
Remarks about RedisKVS.iterateItems()
Looking around in MWoffliner code I came to RedisKvs.iterateItems() (and RedisKvs..iterateItems()). https://github.com/openzim/mwoffliner/blob/master/src/util/RedisKvs.ts#L165
If I understand properly this function is at the core of the iteration over all articles/images/, has a while (true) loop.... but not log entry at all! Sounds a good candidate to me about where it could fail.
So here a few remarks/proposals:
- it should be a
do... while()test, not awhile. depletedis useless, we should just return/exit (if necessary, should not).RedisKvs.scan()reject(err)seems not handled at all!- I don't really get it between
RedisKvs.scan()andpendingScan: Promise<ScanResult>. To me they should be merged in one method and the only thing shared should be thescanCursor. - This line
const items = Array.from(data, (x, k) => k % 2 ? undefined : [x, data[k + 1]]).filter((x) => x);is a mystery to me and I wonder what it does. - We should not
breakIMO, thewhileloop should be exited properly with its checks if necessary. - The line
scanCursor = cursorcreate the conditions of a problem by lately resetting the scanCursor which might have been changed meanwhile by an other async method. - There is no security considering that the same element of Redis can be returned many times. @midik Maybe you have more to say/change to this. I have the strong feeling that this iterator is not done properly.
@midik Think twice about it I think the pmap like it is does not make sense at all because the SCAN requests to Redis are (or should be) serialised anyway.
The whole thing should work with a queue (an async.queue?), it would do the job properly I believe. This is how it was partly working before (look at bc90a3b).
The pointed-out nondeterminism from this is what makes me sweat a little, it's entire possible that the different iterations of the same query/scrape job would result in missing data or non-caught edge cases... thanks for catching this!
@ShadowJonathan Thank your for confirming my first impression, as I'm not really a javascript developer. Would you be able to make a PR to fix it?
I can't find some time soon to look at this, but I'm also not familiar with the codebase, so collaborators could look at this better than I can right now.
@midik Any news here?
@kelson42 I'm chiming with the approach with reading from Redis serially then process the chunks in parallel.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@kelson42 i got a bit of experience of dealing with redis, so please don't mind me abusing this issue for some general questions :)
Most will agree that .iterateItems() is not the ideal solution.
And imho https://github.com/openzim/mwoffliner/blob/main/src/util/rewriteUrls.ts#L16 looks even worse.
So for every Wikilink we ask redis if we got that article. And we sequentially await every single request. And (correct me if i'm wrong) we repeat that for every link, even if the same link already appeared in the same article before.
It's possible to improve that specifically in multiple ways, the best ones i can think of are to collect the links within an article and then request all in one go in a pipeline or even use redis lua capability to write a script able to request them all in one go (redis lua capability is amazing).
But with #1705 this is now the third occasion that makes me wonder if the general order mwoffliner operates in is ideal.
Currently we do:
- fetch article details in batches, save all of it into redis
- fetch articles, save list of required media into redis and ask for every link
- fetch media according to media list in redis
But why don't we do:
- fetch article details in batches -> for every batch, download the articles and media immediately -> assume that all article links are working -> save list of titles we link to in redis (don't save article details) -> write to a temporary file (maybe a zim file itself, ramdisk, redis if you have enough RAM, whatever) -> collect 404s and other issues
- do that parallel for multiple batches
- Curate a list of articles that need re-fetching/parsing based on linked titles and 404s
- go through temporary file and directly copy every article that is ok into final zim, otherwise fetch/parse them again
Some other benefits:
- when i create a zimfile, the bottleneck in the media-download phase is the conversion to webp or optimization. CPU load would be distributed better.
- getting media urls is less ugly and can happen solely on content-type rather than hoping that there's an extension
@kelson42 i got a bit of experience of dealing with redis, so please don't mind me abusing this issue for some general questions :)
Not at all, please englight me!
I'm pretty sure the problem around large iterations on Redis keys is the most problematic we have currently in MWoffliner because:
- Looking at the code it looks "not convincing" to me
- We have more and more freezes during scrapes at 100% of such kind of iterations
- The problem is not that trivial to fix and a few devs have already failed to fix it (them)
Most will agree that
.iterateItems()is not the ideal solution. And imho https://github.com/openzim/mwoffliner/blob/main/src/util/rewriteUrls.ts#L16 looks even worse.
Yes, but pretty sure rewriteUrls works (even if the code is not stylish).... but the iterateItems is buggy. In both case any improvement proposal is super welcome. I would recommend a discussion before coding anything big.
So for every Wikilink we ask redis if we got that article. And we sequentially await every single request. And (correct me if i'm wrong) we repeat that for every link, even if the same link already appeared in the same article before. It's possible to improve that specifically in multiple ways, the best ones i can think of are to collect the links within an article and then request all in one go in a pipeline or even use redis lua capability to write a script able to request them all in one go (redis lua capability is amazing).
I don't have studied the code again, but this is I believe how it works. I agree this is slow/inefficient... but I believe it works perfectly (which is not the less important ;).
But with #1705 this is now the third occasion that makes me wonder if the general order mwoffliner operates in is ideal.
Gathering all the links within an article to bundle the request to Redis, and then throw away the list, sounds a good idea. This has no impact on memory. To the contrary to keep the mapping of all the links in all articles.
Currently we do:
1. fetch article details in batches, save all of it into redis 2. fetch articles, save list of required media into redis and ask for every link 3. fetch media according to media list in redis
I confirm
But why don't we do:
1. fetch article details in batches -> for every batch, download the articles and media immediately -> assume that all article links are working -> save list of titles we link to in redis (don't save article details) -> write to a temporary file (maybe a zim file itself, ramdisk, redis if you have enough RAM, whatever) -> collect 404s and other issues
One point about pictures. We do them at the end because we have only one version of an image, even if this image is used in many resolutions. We put only the highest resolution in the cache/zim. So you need the end of the article scrape to know for each image what is the higher resolution at the time you start to write images in the ZIM.
In general, but in scraping too, what is slow is I/O. MWoffliner is written to reduce I/O on the fs. In the past MWoffliner has far more I/O on the fs (with a lot of temporary data) and this was significantly slower than today. We had also multiple challenges to handle extremly large number of files. Current MWoffliner works as well without a highly complex infrastructure (or the need of a lot of memory or fs space), this is something we have put effort in (and need to take care).
I don't really get how this is related to RedisKVS.iterateItems()... or I'm missing a point? You might need less of them, but you will still need this function.
2. do that parallel for multiple batches
Doing things in parallel does not mean it will go faster... in particular if all the time your limitation factor is the remote Mediawiki backend. This is the case for us most of the time. We don't have a speed problem, we have a stability problem.
3. Curate a list of articles that need re-fetching/parsing based on linked titles and 404s 4. go through temporary file and directly copy every article that is ok into final zim, otherwise fetch/parse them again
Here this is not clear what problem you want to fix, I don't see the relationship with this ticket.
Some other benefits:
* when i create a zimfile, the bottleneck in the media-download phase is the conversion to webp or optimization. CPU load would be distributed better.
Use the S3 cache feature... locally if you have a bandwidth problem. Or don't optimise or convert to webp...
* getting media urls is less ugly and can happen solely on content-type rather than hoping that there's an extension
Relying on URL/filename to identify image type is a problem indeed (more a stylish one than a real one though). We should rely on the response of the HEAD request to know the type. Such a request is currenlty done anyway during the process of downloading files. Indeed, we could study that such a HEAD request might be done earlier (in parallel of article scraping) with a result saved to the REDIS (not sure this is really worth the effort though).
@kelson42 i used this issue because it isn't going to be solved in a satisfying way without a large rewrite and there is no iterateItems() if there is no articleDetailXId
If we forget about --articleList cases for a moment (performance there isn't relevant anyway):
Why do we need the list of articles (articleDetailXId) ?
Why would i need to check for every Wikilink if the article is in the list to fetch, if i am going to fetch all of them anyway?
If the only purpose of that list (in the most performance intensive task of wikipedia_en_all) is to be able to first fetch all article details before fetching the articles, then i think there is a flaw.
I feel like i missed something, maybe it is the categories and trimUnmirroredPages (didn't look into what exactly that does) that isn't possible any other way.
@kelson42 i used this issue because it isn't going to be solved in a satisfying way without a large rewrite and there is no iterateItems() if there is no articleDetailXId
So you say: "it's not possible to iterate properly through a large set of keys stored in Redis?"
That is like asking if saving a file as base64 in a SQL table is a good idea. You can do it if you want. It just doesn't look nice, feels wrong and you might be better off asking the question why you want to do it in the first place.
SCAN and HSCAN are the way to do it, if you have to - and that is what you do, so no issue there - and if you are worried about performance, you can read this: https://docs.keydb.dev/blog/2020/08/10/blog-post/
@uriesk OK. I feel this discussion is not going really anywhere. The problem is probably in the description of the ticket which does not clearly explain the problem to solve.
The focus should be put to fix stability of mwoffliner. To many scrapes are just stuck and freeze without apparent remote backend errors. I guess the problem is around the topic of this ticket.
In what has been proposed, I see a few ideas to speed-up slightly the process. Other idea are no acceptables because this would change the output, need significantly more hardware ressources... and I believe will probably be even slower.
I see no reasonable way to avoid storing the list of all articles and all images. Therefore we need a way to iterate in a robust manner through it.
@kelson42 i used this issue because it isn't going to be solved in a satisfying way without a large rewrite and there is no
iterateItems()if there is noarticleDetailXId
If confirmed! Then Redis is really weak and not the appropriate tool.
If we forget about
--articleListcases for a moment (performance there isn't relevant anyway):Why do we need the list of articles (articleDetailXId) ?
At least to know which internal links are valid and which links should be removed.
Why would i need to check for every Wikilink if the article is in the list to fetch, if i am going to fetch all of them anyway?
You have a lot of red dead links in all wikis anyway. You don't scrape all namespaces anyway... But maybe an optimisation can be done around this.
If the only purpose of that list (in the most performance intensive task of wikipedia_en_all) is to be able to first fetch all article details before fetching the articles, then i think there is a flaw.
It's not, it also to handle all the redirects for examples.
I feel like i missed something, maybe it is the categories and
trimUnmirroredPages(didn't look into what exactly that does) that isn't possible any other way.
I suspect you miss a few things. That said a fresh eye on things is an advantage as well.
Considering that you don't know the code base well, that this software is pretty mature already. I would recommend to focus on fixing concrete problem first. Once you are more familiar with the overall code base, it will be easier IMO to run such a discussion.
If you could help to understand why so often MWoffliner get stuck after scraping all articles (but before scraping images)... that would be really helpful. Bundle the redis requests to rewrite an articles URLs would be a nice optimisation as well.
I will rewrite .iterateItems() without looking at the old code, then you get a second different code with the same functionality that might not cause that issue if we get lucky.
Do you have an example of a build dying that way?
If i feel like it, i do the wikileaks isMirrored too.
I thought redirects are in some limbo, because when you articleList a redirected page, it will be saved under the resulting name with no hint that the old name exists... and its excluded from the Mainpage Index because its original name can't be found in the list.
imho a redirect should be an actual redirect, a html page that forwards you.
But then i am just proposing even more changes you don't like :) Don't take it personal.
I will rewrite .iterateItems() without looking at the old code, then you get a second different code with the same functionality that might not cause that issue if we get lucky.
Thank you. Many of the latest fail attempts of https://farm.openzim.org/recipes/wikipedia_en_all_maxi are interesting IMO.
Having a look to this old PR might be instructive as well https://github.com/openzim/mwoffliner/pull/1245