benoit74
benoit74
PWA: pwa.kiwix.org, 3.3.2 ZIM: https://mirror.download.kiwix.org/zim/.hidden/dev/mes-quartiers-chinois_fr_all_2024-05.zim Safari: 17.5 on MacOS Sonoma 14.5 (both also observed by @Jaifroid on iPhone 15 Pro Max with iOS 17 Safari The Youtube video (see eg....
For some reason, https://library.kiwix.org/viewer#edu.gcfglobal.org_en_all_2024-06 is not properly redirecting to https://library.kiwix.org/viewer#edu.gcfglobal.org_en_all_2024-06/edu.gcfglobal.org/en/topics/ The viewer loads but then for some reason the iframe is not redirected to the proper resource. However https://library.kiwix.org/viewer#edu.gcfglobal.org_en_all_2024-06/ is...
ZIM: https://dev.library.kiwix.org/viewer#fas-military-medicine_en_2024-05/irp.fas.org/doddir/milmed/index.html Chrome: 125 OS: MacOS Sonoma 14.5 When clicking on Steve Aftergood at the bottom of the front page, we should open a mailto: link. This does not happen....
ZIM: mes-quartiers-chinois_fr_all_2024-05 on dev.library.kiwix.org Scraper: warc2zim 2.0.0-dev8 + zimit 2.0.0-dev5 + Browsertrix crawler 1.1.3 Browser: Firefox 126.0 on Mac OS Sonoma 14.5 When clicking on a link with `target="_blank"`, this...
It would be nice if the crawler could automatically fetch rules from `robots.txt` and add `exclusion` rules for every rule present in the `robots.txt` file. I think this functionality should...
I'm trying to create a login profile for www.solidarite-numerique.fr, in order to set cookies which will disable the display of banners highlighted in green in screenshot below. data:image/s3,"s3://crabby-images/2a970/2a9707329a4fc9b07ab406e4de028a57ac8a1991" alt="image" Banner 1...
Debian distro now requires the use of virtual environments to not mess with dependencies installed by official apt packages This commit also removes tldextract update now that pywb is not...
We have three things which can stop the crawler in the middle of a run: - `--sizeLimit`: the maximum warc size - `--timeLimit`: the maximum duration of the crawl -...
Lots of web frameworks store custom data in `data-xxx` tags which are quite standard: https://www.w3schools.com/tags/att_global_data.asp While these tags are custom per application, they regularly contains URLs to assets that will...
Scraping large website (millions of pages) is challenging because: - since the scrape takes long to complete, the chance the website changes during the crawl is significant: - this can...