firefox-translations-training
firefox-translations-training copied to clipboard
TaskCluster not via-CI
Since you are not maintaining Snakemake, I'd like to use TaskCluster. I read these instructions - https://github.com/mozilla/firefox-translations-training/blob/main/docs/task-cluster.md which seem to claim that training runs happen from git CI.
I would like to run taskcluster locally, and configure it to my GCP instance.
Seems like I need to start with
git clone https://github.com/taskcluster/taskcluster
cd taskcluster
docker compose up
echo '127.0.0.1 taskcluster' >> /etc/hosts
Now opening http://taskcluster opens taskcluster.
From here, how can I push the tasks group in this repository to the taskcluster? I feel like the tutorial should cover that . Also, will the tasks spawn GCP workers as needed, or should those be created ahead of time?
I'm not a taskcluster expert, and maybe others can chime in here.
This has information on the taskgraph that is generated: https://taskcluster-taskgraph.readthedocs.io/en/latest/
If you run the utils/preflight_check.py
, it will generate a local taskgraph that you can inspect. It is located in the /artifacts
directory in the repo. I know there is a artifacts/run-task
that is in there. The artifacts/full-task-graph.json
contains all of the tasks that need to run.
As far as how taskcluster works beyond that is beyond my understanding of the system.
There is the https://chat.mozilla.org/#/room/#taskcluster:mozilla.org group that may answer questions.
Getting the tasks graph using:
make preflight-check
the run-task
seems to need to run on the servers, not on my client. I still can't figure out how to do it outside of CI though.
My goal is:
- get a small VM running taskcluster
- "push" a tasks graph to it
- from taskcluster, start a new training job, which will spawn GCP instances to run tasks
Apologies for the slow reply - I didn't see this issue until now.
It is technically possible to run your own Taskcluster instance and run training on it, although I'm not sure I would advise it. Roughly, the steps would be:
- Bring up your own set of Taskcluster services (this is as simple as
docker-compose up
with https://github.com/taskcluster/taskcluster) - Hook up your new Taskcluster instance to a GitHub repo (you need this in order for Decision tasks to run, which is a prerequisite for running training)
- Configure the right workers and worker types (your worker types must match what's in https://github.com/mozilla/firefox-translations-training/blob/e6ec0d5474ce5a98d5e8f0907e0862a356294468/taskcluster/config.yml#L64, unless you modify those entries). This is likely to be the trickiest part, and involves a number of things:
- Creating GCP images that instances can be spawned off of
- Runtime configuration worker runner and other parts of the Taskcluster services
- Other things that I'm forgetting right now...
The Taskcluster channel that @gregtatum linked to is usually pretty keen to help others get the core Taskcluster services working, but I'm not sure how much guidance they'll be able to offer on Translations-specific things, nor can I commit to helping with this.
Another option that we have discussed for the future is to build a feature in Taskgraph to generate a Snakemake definition in addition to a Taskcluster one. We are not sure if/when we'll be able to build it though.
Thanks @bhearsum - I guess since I don't really have permissions on Mozilla's cluster, my only course of action is to set up a new instance.
@marco-c that would be swell! I think that would allow for much easier experimentation for researchers. Until now, I was running it in a docker container on a single 4 GPU machine, and it worked fine, except the translation performance was poor. Now that many bugs should be fixed, I wanted to try again but the snakemake definitions are out-of-date.