warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Add option to exclude some paths from front pages

Open benoit74 opened this issue 1 year ago • 2 comments

Currently, the fact that a ZIM item is marked is_front is purely based on the item mimetype: https://github.com/openzim/warc2zim/blob/5de5d0e0a284611ac376a328fd18b7ad7a9ad5aa/src/warc2zim/items.py#L58-L62

This has the drawback that we sometimes ends-up with unwanted front pages. Typical use case is all iframes which are meant to only be embedded within a page.

I think this could easily be solved with an additional CLI parameter containing an is_front_exclude regex on ZIM path that must not be marked is_front. I don't think having an is_front_include is necessary.

benoit74 avatar Aug 09 '24 09:08 benoit74

Didn't we already had a similar issue where we discussed getting this in-iframe information from the crawler?

rgaudin avatar Aug 09 '24 10:08 rgaudin

Good point, we might even already have the information in the WARC. I don't remember exactly when / where we discussed this. Probably just using this information is serving at least 80% of the need here and in an automated way which is way superior. To be investigated

benoit74 avatar Aug 12 '24 07:08 benoit74