zim-requests icon indicating copy to clipboard operation
zim-requests copied to clipboard

New request: ir.voanews.com

Open benoit74 opened this issue 1 year ago • 5 comments

This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.

  • Website URL: https://ir.voanews.com/

Recipe already created here: https://farm.openzim.org/recipes/ir.voanews.com_persian

benoit74 avatar Feb 19 '24 10:02 benoit74

Task failed: https://farm.openzim.org/pipeline/4889a582-f24d-4364-acad-507c5d94ced6/debug

Cause is https://github.com/openzim/zimit/issues/266

I'm not restarting the recipe, it is clear that something needs to be changed upstream first.

benoit74 avatar Feb 22 '24 12:02 benoit74

Last WARC seems to be mostly OK at https://tmp.kiwix.org/ci/test-warc/ir.voanews.com_persian_2024-04-19/d4cbebe4-c8d3-4729-a083-bf5801beab92_zimit.tar

Conversion to ZIM is completing, but:

  • Videos are not working
  • Big images are not always working since they depend on screen resolution (i.e. they do not display on desktop on tablet but are ok on phones)

benoit74 avatar Jun 03 '24 13:06 benoit74

Sample WARC content (builded with the crawler with --mobileDevice Pixel2) for https://ir.voanews.com/a/un-afghanistan-taliban-doha-meeting-women-rights/7681308.html (only content from gdb.voanews.com which seems to be the image CDN is listed, and URLs are sorted alphabetically for convenience):

https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w408_r1_s.png
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w144_r1.jpg
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w33_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w144_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w33_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w144_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w33_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w144_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w144_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w33_r1.jpg

And for https://ir.voanews.com/a/iran-elections-opposition-dissidents-figures-boycott-call/7681344.html:

https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w408_r1_s.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w33_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w144_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg

What we see is that:

  • the system seems to be automatically fetching multiple resolutions of the same image
  • most images are present only in w144 and w33 sizes
  • the main image of the page (10050000-0aff-0242-f2d3-08da53b2fd74 and 01000000-0aff-0242-ce72-08dc9778f46b) are also present in w250 and w408

When opening the same pages on desktop it tries to load a different resolution (and adds a _s suffix to the URL):

  • https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1023_r1_s.png for first article
  • https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg for second article

When click the "high res" button available on the page it loads a different resolution and a name pattern is slightly different:

  • https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1597_n_r1_st_s.png for first article
  • https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1597_n_r1_st_s.jpg for second article

I did not found cases where multiple "big" images where present on a single article, there was always only one single "big" image.

Upstream server is in fact resizing the image on-demand, you can request any resolution, this is not pre-computed in advance. The _n, _st, _r1 are flags used to enable / disable some watermarks / overlays (e.g _st activates an subtitle in upper right corner). The _cx0_cy6_cw0 is used to change the center of the image (probably cropping few pixels).

It is also important to note that creating fuzzyrules for this is made more complicated by the fact that the lowest resolution are fetched first, so a fuzzyrules covering all resolution will technically work but store only crappy 33pixels images in the ZIM and website will be significantly degraded.

So the conclusion is that:

  • the system is behaving as instructed, when we open such a WARC/ZIM on a screen of same size range than the Pixel2, it displays the image properly
  • I don't know how we can fix this problem with fuzzy rules in static / dynamic rewriting

Do we need to wait for https://github.com/openzim/warc2zim/issues/271 ? (not even sure how this could exactly solve the issue)

benoit74 avatar Jul 02 '24 13:07 benoit74

I've identified fuzzy rules which might work indeed:

  - pattern: gdb.voanews.com/(.*_w33_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*_w144_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*_w250_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*)_w.*(\..*?)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1_high\2

Associated JS tests:


test('gdb.voanews.com_1', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1023_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_high.png',
  );
});

test('gdb.voanews.com_2', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1597_n_r1_st_s.jpg'),
    'gdb.voanews.com.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_high.jpg',
  );
});

test('gdb.voanews.com_3', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png',
  );
});

test('gdb.voanews.com_4', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png',
  );
});

test('gdb.voanews.com_5', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png',
  );
});

Unfortunately, they do not work due to another limitation in warc2zim (I'll open a ticket right now)

benoit74 avatar Jul 02 '24 15:07 benoit74

Marking as done in zimit2 project since we are not going to complete this task as part of the project

benoit74 avatar Sep 10 '24 09:09 benoit74