zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Retrieve automatically the assets present in a `data-xxx` tag

Open benoit74 opened this issue 8 months ago • 2 comments

Lots of web frameworks store custom data in data-xxx tags which are quite standard: https://www.w3schools.com/tags/att_global_data.asp

While these tags are custom per application, they regularly contains URLs to assets that will be dynamically loaded.

For instance, on solar.lowtechmagazine.com, every picture in an article contains two data tags:

<img 
  alt="Image: Steel rebar construction for the concrete foundation of a wind turbine in Gilliam County, US. Image by Goose Chap, Wikimedia Commons (CC BY-SA 4.0)" 
  data-dither="/2024/03/how-to-escape-from-the-iron-age/images/dithers/rebar-foundation-wind-turbine_dithered.png" 
  data-original="https://solar.lowtechmagazine.com/2024/03/how-to-escape-from-the-iron-age/images/rebar-foundation-wind-turbine_hu441e9fb4cbb0b124bd4444c8de3f97e3_4611143_800x800_fit_q90_h2_box.webp" 
  loading="lazy" 
  src="https://solar.lowtechmagazine.com/2024/03/how-to-escape-from-the-iron-age/images/dithers/rebar-foundation-wind-turbine_dithered.png"
>

Currently, the crawler does not process at all these data-xxx tags, and when they are used in the replay phase the corresponding assets are missing from the WARC / ZIM and we end up with a broken thing.

It would be valuable to automatically retrieve all data-xxx tag assets, whenever the data-xxx value looks like an absolute or relative URL.

This can probably be done with a behavior in Browsertrix crawler. Only question is wether this is implemented as a custom openzim behavior, or if we do this directly in Browsertrix crawler.

benoit74 avatar Jun 07 '24 14:06 benoit74