wp1 icon indicating copy to clipboard operation
wp1 copied to clipboard

Integrate zimfarm dev setup

Open elfkuzco opened this issue 2 months ago • 28 comments

Rationale

As part of cleanup in zimfarm API (openzim/zimfarm#1391), requests to create recipes/tasks now require an offliner definition version. This PR sets the version of the offliner definition from env variable and sets up zimfarm containers in a docker-compose graph. Previously, the API used "initial" as the definition versions but as scrapers evolve and arguments change, the definitions change too.

Changes

  • use mwoffliner definition version from env (default to image tag)
  • set up compose graph that includes zimfarm-containers. These are created with profiles: zimfarm and zimfarm-worker. The former starts up only the API and UI while the latter starts up the worker and receiver in addition.

elfkuzco avatar Oct 20 '25 14:10 elfkuzco

Codecov Report

:x: Patch coverage is 88.63636% with 5 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 92.87%. Comparing base (63b7a74) to head (544ecb6).

Files with missing lines Patch % Lines
wp1/logic/builder.py 88.00% 3 Missing :warning:
wp1/zimfarm.py 88.88% 2 Missing :warning:

:x: Your patch check has failed because the patch coverage (88.63%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1027      +/-   ##
==========================================
- Coverage   92.90%   92.87%   -0.04%     
==========================================
  Files          73       73              
  Lines        4229     4238       +9     
==========================================
+ Hits         3929     3936       +7     
- Misses        300      302       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Oct 20 '25 18:10 codecov[bot]

We should really run this from end-to-end to ensure this setup works correctly.

Yes I can definitely help with that. I'll patch this PR and try setting up/running the zimfarm locally and confirm that I can create and download ZIMs.

audiodude avatar Oct 28 '25 14:10 audiodude

Updated the files with the recent changes:

  • added separate buckets for artifacts, logs and zims
  • updated the README to detail the worker resources and reason for the offliner definition
  • updated worker resources to 3 CPU, 20G RAM, 20G disk

elfkuzco avatar Oct 30 '25 05:10 elfkuzco

Code LGTM, waiting for e2e test from @audiodude (if I get it correctly) to give my formal approval

benoit74 avatar Oct 30 '25 07:10 benoit74

I made some minor tweaks to the PR, but it's still not working. My Zimfarm is still reporting the following for requests to http://localhost:8004/v2/schedules:

{"success":false,"message":"Offliner definition for offliner mwoffliner with version 1.17.2 does not exist"}

EDIT: This is after following the directions in the README and updating my local credentials.py

audiodude avatar Oct 30 '25 17:10 audiodude

Hum, this is indeed a problem. To unblock you, please set 'definition_version': 'dev' in your local credentials.py, it should do the trick.

It is however not the proper way to solve this situation to merge this PR. We will continuously have new offliner definitions arriving, and all of them should be stored in the local Zimfarm DB so that dev can use mostly any mwoffliner version / definition version. I feel like the docker/zimfarm/create_offliners.sh should fetch all existing definitions from api.farm.openzim.org and populate the ones missing in local dev DB. Documentation would then state that developers should rerun this script on a regular basis to fetch new offliner definitions if they want to use them in their credentials.py.

benoit74 avatar Oct 30 '25 19:10 benoit74

Okay, with the workaround I can successfully create schedules and schedule tasks.

However my tasks seem stuck:

image

It looks like I have a worker, but it was "last seen 12 minutes ago"?

image

audiodude avatar Nov 03 '25 05:11 audiodude

However my tasks seem stuck:

That's probably because the worker doesn't have enough resources to run the task

elfkuzco avatar Nov 03 '25 09:11 elfkuzco

@elfkuzco can you try to reproduce @audiodude issue and confirm it can be solved with more resources to the worker? I don't get what is missing, resources seems to be sufficient.

benoit74 avatar Nov 03 '25 09:11 benoit74

I've pulled the latest zimfarm images and my jobs are still stuck.

image image

audiodude avatar Nov 03 '25 19:11 audiodude

I've pulled the latest zimfarm images and my jobs are still stuck.

can i see your worker logs?

elfkuzco avatar Nov 03 '25 20:11 elfkuzco

can i see your worker logs?

Where do I find those?

audiodude avatar Nov 03 '25 20:11 audiodude

maybe docker logs -f <worker-container> in a different shell

elfkuzco avatar Nov 03 '25 20:11 elfkuzco

which one is the worker container? I have:

tmoney@tmoney-linux:~/code/wp1/wp1-frontend$ docker ps
CONTAINER ID   IMAGE                                    COMMAND                  CREATED          STATUS                    PORTS                                                             NAMES
4e88bcacd96f   ghcr.io/openzim/zimfarm-ui:latest        "/docker-entrypoint.…"   30 minutes ago   Up 29 minutes             127.0.0.1:8003->80/tcp                                            zimfarm-ui
21311c6dd01e   wp1-dev-dev-workers                      "/bin/sh -c 'supervi…"   30 minutes ago   Up 30 minutes                                                                               wp1bot-workers-dev
702158e27b52   ghcr.io/openzim/zimfarm-backend:latest   "uvicorn zimfarm_bac…"   30 minutes ago   Up 30 minutes (healthy)   127.0.0.1:8004->80/tcp                                            zimfarm-api
08aeb6538488   postgres:17.3-bookworm                   "docker-entrypoint.s…"   30 minutes ago   Up 30 minutes (healthy)   127.0.0.1:2345->5432/tcp                                          zimfarm-postgresdb
af5573efe297   wp1-dev-dev-database                     "docker-entrypoint.s…"   30 minutes ago   Up 30 minutes             0.0.0.0:6300->3306/tcp, [::]:6300->3306/tcp                       wp1bot-db-dev
a11a4bd190ef   redis                                    "docker-entrypoint.s…"   30 minutes ago   Up 30 minutes (healthy)   0.0.0.0:9736->6379/tcp, [::]:9736->6379/tcp                       wp1bot-redis-dev
d7075c2628f6   minio/minio                              "/usr/bin/docker-ent…"   4 days ago       Up 4 days (healthy)       0.0.0.0:9000-9001->9000-9001/tcp, [::]:9000-9001->9000-9001/tcp   wp1bot-minio-dev
1d466e07260f   mariadb:10.4                             "docker-entrypoint.s…"   19 months ago    Up 2 weeks                0.0.0.0:6600->3306/tcp, [::]:6600->3306/tcp                       wp1bot-test-db
7f06c4c77a50   5b0542ad1e77                             "docker-entrypoint.s…"   19 months ago    Up 2 weeks                0.0.0.0:9777->6379/tcp, [::]:9777->6379/tcp                       wp1bot-test-redis

audiodude avatar Nov 03 '25 20:11 audiodude

There doesn't appear to be a worker container running in the list. From the compose file, the name should be zimfarm-worker-manager

elfkuzco avatar Nov 03 '25 20:11 elfkuzco

can you do docker logs -f zimfarm-worker-manager. My guess is that it died for some reason. Also, did you start the services with the zimfarm-worker docker profile. i.e docker compose -f docker-compose-dev.yml --profile zimfarm --profile zimfarm-worker up --pull always --build

elfkuzco avatar Nov 03 '25 20:11 elfkuzco

This is the command I used: docker compose -f docker-compose-dev.yml --profile zimfarm --profile zimfarm-worker up --pull always --build -d

Here are the logs:

tmoney@tmoney-linux:~/code/wp1/wp1-frontend$ docker logs zimfarm-worker-manager 
[2025-11-03 19:46:36,061: INFO] starting zimfarm worker-manager.
[2025-11-03 19:46:36,061: INFO] configuration:
	username=test_worker
	webapi_uris=['http://zimfarm-api:80/v2']
	workdir=/data
	worker_name=test_worker
	OFFLINERS=['mwoffliner', 'youtube', 'phet', 'gutenberg', 'sotoki', 'nautilus', 'ted', 'openedx', 'zimit', 'kolibri', 'wikihow', 'ifixit', 'freecodecamp', 'devdocs', 'mindtouch']
	PLATFORMS_TASKS={}
	poll_interval=10
	sleep_interval=5
	selfish=False
[2025-11-03 19:46:36,061: INFO] testing workdir at /data…
[2025-11-03 19:46:36,061: INFO] 	workdir is available and writable
[2025-11-03 19:46:36,061: INFO] testing private key at /etc/ssh/keys/zimfarm…
[2025-11-03 19:46:36,061: CRITICAL] 	private key is not a readable path

audiodude avatar Nov 03 '25 20:11 audiodude

Okay I think I know the problem. In the first step in the README, when I initially create the Docker graph, this path doesn't exist: ./docker/zimfarm/id_ed25519.

I've encountered this before, but at that point Docker creates that path as a directory. Then, when we run the create_worker script, it can't overwrite the directory with the private key.

audiodude avatar Nov 03 '25 20:11 audiodude

Yes. Oddly enough, it happened to me too. Would update the docs to prevent this from happening to anyone else.

elfkuzco avatar Nov 03 '25 20:11 elfkuzco

Just want to make sure. Is this line in the docker-compose file supposed to map a file to a file, or a directory to a directory?

volumes:
  - ./docker/zimfarm/id_ed25519:/etc/ssh/keys/zimfarm

If it's meant to map a file, we should simply do a touch docker/zimfarm/id_ed255519 before we start the first docker graph, so that it is initially mapped as an (empty) file that can then be overwritten. Also, I didn't even notice the line in the script that said "now copy the key blah blah". Can we just mv the key ourselves to that location within the script?

audiodude avatar Nov 03 '25 21:11 audiodude

It's supposed to map to a file. I will revise the shell script to mv the key to that path.

elfkuzco avatar Nov 03 '25 21:11 elfkuzco

Okay my tasks are being picked up by the worker now! But they are failing. I see this in "Scraper stderr":

[error] [2025-11-03T21:16:58.480Z] Failed to run mwoffliner after [0s]:
 Error: Unknown S3 region set
    at S3.setRegion (/tmp/mwoffliner/src/S3.ts:37:13)
    at new S3 (/tmp/mwoffliner/src/S3.ts:26:10)
    at Module.execute (/tmp/mwoffliner/src/mwoffliner.lib.ts:149:13)
    at <anonymous> (/tmp/mwoffliner/src/cli.ts:66:8)

I assume it's because the optimization cache URL I'm sending in is https://localhost:9000/?keyId=minio_key&secretAccessKey=minio_secret&bucketName=org-kiwix-dev-cache and it's trying to parse a region from the hostname?

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

audiodude avatar Nov 03 '25 21:11 audiodude

Can you use one similar to the minio one configured for the uploader?

elfkuzco avatar Nov 03 '25 22:11 elfkuzco

EDIT: If so, I understand that this is an issue for the mwoffliner repo, of course.

The thing is the container can't access localhost. You can use https://minio..... because the container can resolve the hostname minio since they all share the same network

elfkuzco avatar Nov 03 '25 22:11 elfkuzco

Or if you want, you can omit the optimization URL from your task.

elfkuzco avatar Nov 03 '25 22:11 elfkuzco

Okay I definitely think we can skip the S3 cache for dev scraping. After I got rid of that, I got a new error from mwoffliner, which was:

 Failed to read articleList from [http://localhost:5000/v1/builders/0b76807e-c1e3-44c0-a815-b0e8405a51e8/selection/latest.tsv]

This makes sense, since the worker is running inside of the docker compose network, while my WP1 web/api/backend is running on the host machine. In fact, this is the exact reason we need to have a zimfarm in dev anyways, because we've changed the logic for the ZIM creation to use a dynamic URL from WP1 itself rather than a static file list on S3.

I think at this point, I'm going to start working on putting the dev backend server into the docker compose graph as well, with all the updates to configuration and README that are required for that. I'd like to use this same PR and then just merge the whole thing once we have a working, consistent dev environment.

@benoit74 @elfkuzco WDYT?

audiodude avatar Nov 04 '25 01:11 audiodude

I agree with you.

elfkuzco avatar Nov 04 '25 09:11 elfkuzco

Yes for dev we should skip the S3 cache, we will not gain much besides pain. And this is more an internal detail to mwoffliner operation, not really needed.

I like the idea of adding the backend to the docker graph in same PR. This is a great opportunity to nail down this dev setup issues and have a reproducible setup devs can use from e2e. No more excuses for not testing stuff once in a while from e2e. Also a great asset in term of documentation / learning base.

I would even suggest to also add web and api to the docker graph. With proper mount point and configuration it should be possible to have hot reload whenever dev changes something in the codebase, at least this is what we achieved to have in zimfarm, zimit-frontend and cms repos, and it is (mostly?) totally transparent in terms of performances. It free the developers from having anything to install on their dev machine besides Docker, and ensures there is no headaches due to bad versions and stuff like that. Quite important for everyone which is not a core maintainer and / or a bit lazy to setup stuff correctly on his machine (which includes myself ^^)

benoit74 avatar Nov 04 '25 21:11 benoit74

Okay I've got the following in my docker:

^Ctmoney@tmoney-linux:~/code/wp1$ docker ps
CONTAINER ID   IMAGE                                           COMMAND                  CREATED         STATUS                   PORTS                                                             NAMES
e5a3e2025bc8   wp1-dev-dev-web                                 "flask --app wp1.web…"   5 days ago      Up 6 minutes             0.0.0.0:5000->5000/tcp, [::]:5000->5000/tcp                       wp1bot-web-dev
a133aa726bb8   ghcr.io/openzim/zimfarm-worker-manager:latest   "worker-manager --we…"   6 days ago      Up 6 days                                                                                  zimfarm-worker-manager
978375c302b7   ghcr.io/openzim/zimfarm-ui:latest               "/docker-entrypoint.…"   6 days ago      Up 6 days                127.0.0.1:8003->80/tcp                                            zimfarm-ui
055a203f63ff   wp1-dev-dev-workers                             "/bin/sh -c 'supervi…"   6 days ago      Up 5 minutes                                                                               wp1bot-workers-dev
5294b71f64f6   ghcr.io/openzim/zimfarm-backend:latest          "uvicorn zimfarm_bac…"   6 days ago      Up 6 days (healthy)      127.0.0.1:8004->80/tcp                                            zimfarm-api
590f8488d6f7   minio/minio                                     "/usr/bin/docker-ent…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9000-9001->9000-9001/tcp, [::]:9000-9001->9000-9001/tcp   wp1bot-minio-dev
343148f4b8bc   postgres:17.3-bookworm                          "docker-entrypoint.s…"   6 days ago      Up 6 days (healthy)      127.0.0.1:2345->5432/tcp                                          zimfarm-postgresdb
92261129c194   redis                                           "docker-entrypoint.s…"   6 days ago      Up 6 minutes (healthy)   0.0.0.0:9736->6379/tcp, [::]:9736->6379/tcp                       wp1bot-redis-dev
1f0cd8e54a2f   wp1-dev-dev-database                            "docker-entrypoint.s…"   6 days ago      Up 6 minutes             0.0.0.0:6300->3306/tcp, [::]:6300->3306/tcp                       wp1bot-db-dev
1d466e07260f   mariadb:10.4                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:6600->3306/tcp, [::]:6600->3306/tcp                       wp1bot-test-db
7f06c4c77a50   5b0542ad1e77                                    "docker-entrypoint.s…"   20 months ago   Up 3 weeks               0.0.0.0:9777->6379/tcp, [::]:9777->6379/tcp                       wp1bot-test-redis

I've changed the URL for the article list we send to Zimfarm to try and use the WP1 API that's running in docker, so I'm using http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv. But I get the following error:

[error] [2025-11-10T00:17:35.346Z] Failed to read articleList from [http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv] Error: Failed to read articleList from URL: http://web-dev:5000/v1/builders/94330657-fe26-4aea-8f14-f959ede293a0/selection/latest.tsv

I understand that this is a network connectivity issue, and I need to use the right domain for the WP1 API. However, the part I don't understand is the network topology for worker/worker-manger/mwoffliner/etc and where the mwoffliner is actually running on the network. What should I put for http://web-dev:5000? Thanks!

audiodude avatar Nov 10 '25 00:11 audiodude

Also tried with wp1bot-web-dev:

[error] [2025-11-10T02:41:00.706Z] Failed to read articleList from [http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv] Error: Failed to read articleList from URL: http://wp1bot-web-dev:5000/v1/builders/6a1f2ee7-5947-4222-8e12-b043cf376af4/selection/latest.tsv

It's reachable from zimfarm-api:

tmoney@tmoney-linux:~/code/wp1$ docker exec -it zimfarm-api bash
root@5294b71f64f6:/# curl http://wp1bot-web-dev:5000
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>WP 1.0 API</title>
   ....<SNIP>

audiodude avatar Nov 10 '25 02:11 audiodude