python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Deduplicate ZIM tag values

Open benoit74 opened this issue 1 year ago • 5 comments

When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake.

https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93

benoit74 avatar Apr 19 '24 08:04 benoit74

@benoit74 I'd like to work on this.

One possible solution is to convert the list into a set and back to a list again so that duplicates will be removed.

self.tags = list(set([*self.tags, "_category:ted", "ted", "_videos:yes"]))

WDYT?

dan-niles avatar Apr 19 '24 11:04 dan-niles

Should probably be done in scraperlib

rgaudin avatar Apr 19 '24 12:04 rgaudin

Should probably be done in scraperlib

Agreed, let's transfer the issue.

@dan-niles yes, that's the idea, but to do in scraperlib so that it benefit all scrapers, are you still interested?

benoit74 avatar Apr 19 '24 12:04 benoit74

@benoit74 Sure, I'm up for it. I think we can remove the duplicates inside the config_metadata method in the scraperlib code.

I noticed that some scrapers like ted and youtube use the make_zim_file function from scraperlib, which initializes a Creator object and calls the config_metadata method. While warc2zim and kolibri initialize a Creator object and calls the config_metadata method directly.

Since these scrapers eventually end up calling the config_metadata method, I think if we do the deduplication there, we only have to update in one place. What do you think?

dan-niles avatar Apr 20 '24 05:04 dan-niles

Yep, this makes sense. Good observations!

benoit74 avatar Apr 30 '24 12:04 benoit74

Strongly related to #164, should be implemented together

benoit74 avatar Jun 11 '24 11:06 benoit74