firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

TaskCluster not via-CI

Open AmitMY opened this issue 1 year ago • 5 comments

Since you are not maintaining Snakemake, I'd like to use TaskCluster. I read these instructions - https://github.com/mozilla/firefox-translations-training/blob/main/docs/task-cluster.md which seem to claim that training runs happen from git CI.

I would like to run taskcluster locally, and configure it to my GCP instance.

Seems like I need to start with

git clone https://github.com/taskcluster/taskcluster
cd taskcluster
docker compose up
echo '127.0.0.1 taskcluster' >> /etc/hosts

Now opening http://taskcluster opens taskcluster.

From here, how can I push the tasks group in this repository to the taskcluster? I feel like the tutorial should cover that . Also, will the tasks spawn GCP workers as needed, or should those be created ahead of time?

AmitMY avatar Jan 26 '24 21:01 AmitMY

I'm not a taskcluster expert, and maybe others can chime in here.

This has information on the taskgraph that is generated: https://taskcluster-taskgraph.readthedocs.io/en/latest/

If you run the utils/preflight_check.py, it will generate a local taskgraph that you can inspect. It is located in the /artifacts directory in the repo. I know there is a artifacts/run-task that is in there. The artifacts/full-task-graph.json contains all of the tasks that need to run.

As far as how taskcluster works beyond that is beyond my understanding of the system.

There is the https://chat.mozilla.org/#/room/#taskcluster:mozilla.org group that may answer questions.

gregtatum avatar Jan 30 '24 19:01 gregtatum

Getting the tasks graph using:

make preflight-check

the run-task seems to need to run on the servers, not on my client. I still can't figure out how to do it outside of CI though.

My goal is:

  1. get a small VM running taskcluster
  2. "push" a tasks graph to it
  3. from taskcluster, start a new training job, which will spawn GCP instances to run tasks

AmitMY avatar Jan 31 '24 11:01 AmitMY

Apologies for the slow reply - I didn't see this issue until now.

It is technically possible to run your own Taskcluster instance and run training on it, although I'm not sure I would advise it. Roughly, the steps would be:

  • Bring up your own set of Taskcluster services (this is as simple as docker-compose up with https://github.com/taskcluster/taskcluster)
  • Hook up your new Taskcluster instance to a GitHub repo (you need this in order for Decision tasks to run, which is a prerequisite for running training)
  • Configure the right workers and worker types (your worker types must match what's in https://github.com/mozilla/firefox-translations-training/blob/e6ec0d5474ce5a98d5e8f0907e0862a356294468/taskcluster/config.yml#L64, unless you modify those entries). This is likely to be the trickiest part, and involves a number of things:
    • Creating GCP images that instances can be spawned off of
    • Runtime configuration worker runner and other parts of the Taskcluster services
    • Other things that I'm forgetting right now...

The Taskcluster channel that @gregtatum linked to is usually pretty keen to help others get the core Taskcluster services working, but I'm not sure how much guidance they'll be able to offer on Translations-specific things, nor can I commit to helping with this.

bhearsum avatar Feb 15 '24 00:02 bhearsum

Another option that we have discussed for the future is to build a feature in Taskgraph to generate a Snakemake definition in addition to a Taskcluster one. We are not sure if/when we'll be able to build it though.

marco-c avatar Feb 15 '24 08:02 marco-c

Thanks @bhearsum - I guess since I don't really have permissions on Mozilla's cluster, my only course of action is to set up a new instance.

@marco-c that would be swell! I think that would allow for much easier experimentation for researchers. Until now, I was running it in a docker container on a single 4 GPU machine, and it worked fine, except the translation performance was poor. Now that many bugs should be fixed, I wanted to try again but the snakemake definitions are out-of-date.

AmitMY avatar Feb 15 '24 14:02 AmitMY