New request: ir.voanews.com
This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.
- Website URL: https://ir.voanews.com/
Recipe already created here: https://farm.openzim.org/recipes/ir.voanews.com_persian
Task failed: https://farm.openzim.org/pipeline/4889a582-f24d-4364-acad-507c5d94ced6/debug
Cause is https://github.com/openzim/zimit/issues/266
I'm not restarting the recipe, it is clear that something needs to be changed upstream first.
Last WARC seems to be mostly OK at https://tmp.kiwix.org/ci/test-warc/ir.voanews.com_persian_2024-04-19/d4cbebe4-c8d3-4729-a083-bf5801beab92_zimit.tar
Conversion to ZIM is completing, but:
- Videos are not working
- Big images are not always working since they depend on screen resolution (i.e. they do not display on desktop on tablet but are ok on phones)
Sample WARC content (builded with the crawler with --mobileDevice Pixel2) for https://ir.voanews.com/a/un-afghanistan-taliban-doha-meeting-women-rights/7681308.html (only content from gdb.voanews.com which seems to be the image CDN is listed, and URLs are sorted alphabetically for convenience):
https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w408_r1_s.png
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w144_r1.jpg
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w33_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w144_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w33_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w144_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w33_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w144_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w144_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w33_r1.jpg
And for https://ir.voanews.com/a/iran-elections-opposition-dissidents-figures-boycott-call/7681344.html:
https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w408_r1_s.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w33_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w144_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg
What we see is that:
- the system seems to be automatically fetching multiple resolutions of the same image
- most images are present only in w144 and w33 sizes
- the main image of the page (
10050000-0aff-0242-f2d3-08da53b2fd74and01000000-0aff-0242-ce72-08dc9778f46b) are also present in w250 and w408
When opening the same pages on desktop it tries to load a different resolution (and adds a _s suffix to the URL):
- https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1023_r1_s.png for first article
- https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg for second article
When click the "high res" button available on the page it loads a different resolution and a name pattern is slightly different:
- https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1597_n_r1_st_s.png for first article
- https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1597_n_r1_st_s.jpg for second article
I did not found cases where multiple "big" images where present on a single article, there was always only one single "big" image.
Upstream server is in fact resizing the image on-demand, you can request any resolution, this is not pre-computed in advance. The _n, _st, _r1 are flags used to enable / disable some watermarks / overlays (e.g _st activates an subtitle in upper right corner). The _cx0_cy6_cw0 is used to change the center of the image (probably cropping few pixels).
It is also important to note that creating fuzzyrules for this is made more complicated by the fact that the lowest resolution are fetched first, so a fuzzyrules covering all resolution will technically work but store only crappy 33pixels images in the ZIM and website will be significantly degraded.
So the conclusion is that:
- the system is behaving as instructed, when we open such a WARC/ZIM on a screen of same size range than the Pixel2, it displays the image properly
- I don't know how we can fix this problem with fuzzy rules in static / dynamic rewriting
Do we need to wait for https://github.com/openzim/warc2zim/issues/271 ? (not even sure how this could exactly solve the issue)
I've identified fuzzy rules which might work indeed:
- pattern: gdb.voanews.com/(.*_w33_.*)
replace: gdb.voanews.com.fuzzy.replayweb.page/\1
- pattern: gdb.voanews.com/(.*_w144_.*)
replace: gdb.voanews.com.fuzzy.replayweb.page/\1
- pattern: gdb.voanews.com/(.*_w250_.*)
replace: gdb.voanews.com.fuzzy.replayweb.page/\1
- pattern: gdb.voanews.com/(.*)_w.*(\..*?)
replace: gdb.voanews.com.fuzzy.replayweb.page/\1_high\2
Associated JS tests:
test('gdb.voanews.com_1', (t) => {
t.is(
applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1023_r1_s.png'),
'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_high.png',
);
});
test('gdb.voanews.com_2', (t) => {
t.is(
applyFuzzyRules('gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1597_n_r1_st_s.jpg'),
'gdb.voanews.com.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_high.jpg',
);
});
test('gdb.voanews.com_3', (t) => {
t.is(
applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png'),
'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png',
);
});
test('gdb.voanews.com_4', (t) => {
t.is(
applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png'),
'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png',
);
});
test('gdb.voanews.com_5', (t) => {
t.is(
applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png'),
'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png',
);
});
Unfortunately, they do not work due to another limitation in warc2zim (I'll open a ticket right now)
Marking as done in zimit2 project since we are not going to complete this task as part of the project