New request: iranwire.com
This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.
- Website URL: https://iranwire.com/fa/
Recipe already created here: https://farm.openzim.org/recipes/iranwire.com_persian
Impacted by upstream issue for now: https://github.com/openzim/warc2zim/issues/188
Now impacted by https://github.com/openzim/warc2zim/issues/261
Issues mentioned above have been solved / are not occurring anymore.
Problem now is that we are blocked by Cloudflare after some times, it looks like all request finishes with 403 errors at some point. We are getting into contact with iranwire.com persons to find a solution (IP whitelisting, ...).
Some of our worker IPs (ondemand IPv4 and IPv6, athena18 IPv4 and IPv6 and pixelmemory IPv4) have been whitelisted from iranwire.com.
Crawl completed successfully and produced the WARC:
Conversion to ZIM failed due to known bug in 2.0.1, since then fixed in 2.0.2.
What we now see is that:
- the crawling seems to be mostly complete, we do not see many resources missing (once we remove crap from twitter/facebook/addtoany with something like
^.*ZimPath\((?:t\.me|www\.facebook|www\.reddit|twitter\.com|api\.whatsapp|iranwire\.com\/login|iranwire\.com\/register|www\.addtoany).*$, only 10% of the log remains ; and if we focus on iranwire.com, we have about 10k unique resources missing, and most of them are images, which is linked to next item -> - we miss a significant number of images, implementing https://github.com/openzim/zimit/issues/316 would probably solve the problem
- the videos are missing from the WARC, because the Youtube player is "hidden" behind a picture click event, i.e. it is dynamically added to the page when the user click the video ; autoplay behavior hence fails to find the video and does not trigger
All this seems to be feasible to be fixed with some engineering efforts
For the record, see https://kiwix.freshdesk.com/a/tickets/71198 for some details around IP whitelisting
After some investigation, it looks like I was wrong in my previous analysis of why images are missing. The autofetch behavior is supposed to grab them all. I don't get why the WARC is incomplete then. I will start again the recipe with a low limit on how many pages to fetch, just to confirm how it is working (or not).
I've investigated also the video issue. I've succeeded to write a custom behavior to trigger the play of the youtube video, however it does not wait for the player to really start and there is no video in the WARC in the end.
For reference, this is the custom behavior I used
// custom behavior for iranwire.com website: automatically start the videos since Youtube player
// is not inside the DOM until play button is clicked.
class IranWireCom {
static get id() {
return "IranWireCom";
}
static isMatch() {
const pathRegex = /https:\/\/iranwire\.com\//;
return !!window.location.href.match(pathRegex);
}
static init() {
return {
state: { playbuttons: 0 },
};
}
async* run(ctx) {
const { xpathNodes, scrollAndClick, getState } = ctx.Lib;
const playButtons = xpathNodes("//*[contains(@class,'video-component-play')]");
for await (const playButton of playButtons) {
scrollAndClick(playButton);
yield getState(ctx, "Video play button", "playbuttons");
}
yield "IranWireCom Behavior Complete";
}
}
When placed inside a custom-behaviors subfolder it is simply activated by passing -v $PWD/custom-behaviors:/custom-behaviors to docker command and --customBehaviors /custom-behaviors/ to the crawler, not forgetting to activate siteSpecific behavior with --behaviors "siteSpecific,...".
I tried to change the order of behaviors to check if it might have an impact but without success.
@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?
@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?
Yes, as temporary solution
OK so I finally achieved to find the problem for the images: for some reasons, JS code is adding an inline visibility: hidden style to the first image of every article. I struggle to find the JS responsible for this, so for now we will live with a CSS trick/hack to restore original visibility.
I've created the custom CSS to get rid of the bug, of ads and social links and search boxes.
I've also reconfigured the recipe to include few useful pages which are not inside the /fa suffix but still in Farsi as far as I can tell (authors, petitions and questions).
Last task execution with only few pages (100) proved to create a ZIM which seems OK. I will relaunch again the recipe on all pages and let's see what comes out in more or less 1 week.
Last task did not produced a full ZIM at all because we now need to add the base URL in include regexp.
Petitions pages are not displaying any image due to https://github.com/openzim/zimit/issues/316
So I've modified the include regex to not include them for now: iranwire.com(?:$|\/$|\/author\/|\/petition\/|\/questions\/|\/fa\/). I've relaunched the task first with depth set to 1 to confirm the include regex and the CSS custom are OK.
New recipe configuration is now scraping tons of "stupid" pages like https://[email protected]/fa/features/38626/ with a info user.
I've canceled the recipe, modified the include setting to not includes these pages and re-requested the recipe.
The link you gave looks like a regular entry - what is "stupid" about it?
Thank you for asking! The "stupid" thing is that URL is [email protected] instead of iranwire.com. This looks like a bug on their side on a random page which is suddenly duplicating all entries to fetch + store in the ZIM (once for [email protected] and once for iranwire.com).
What I said is not totally correct. This is stupid because the info@ part is anyway dropped by warc2zim, so in the end we will store only one entry and all links with info@ will be rewritten without it. So we are "only" loosing our time fetching to many pages (but this is already lot of time lost).
It looks like last ZIM is ready for review: https://dev.library.kiwix.org/content/iranwire-com_far_all_2024-07/
What is known to not work:
- videos and audios: most have been made inaccessible, but few remain ; I consider it is not feasible to make them work within current timeframe / budget
- questions like the ones on iranwire.com/questions/legal/ are not working => upstream bug is https://github.com/openzim/warc2zim/issues/363 and will be solved quickly
@Popolechien please review the ZIM to identify whatever needs to be fixed before communicating the ZIM to the client
Moved to prod since we do not have more feedbacks since weeks, it is supposed to be OK.
ZIM is ready at https://library.kiwix.org/viewer#iranwire-com_far_all or https://download.kiwix.org/zim/zimit/iranwire-com_far_all_2024-09.zim