Ben Hearsum (he/him)
Ben Hearsum (he/him)
That seems sensible to me, yeah. The only reason I put the total chunks in the name was because it's what we do in chunked jobs on other projects. If...
> I think the cache `teacher-ensemble` needs to be removed as well: > > ``` > cache: > from-parameters: > teacher-ensemble: training_config.experiment.teacher-ensemble > ``` Yeah, good call. Come to think...
> Looks good to me. FWIW - Ben is refactoring this `mounts` stuff on #546 We'll see! I'm not sure if I will be refactoring that in advance, or following...
I don't think there's anything we can do to make this better in this repo nor taskgraph. This is a worker issue that's been filed as https://github.com/taskcluster/taskcluster/issues/6894
There's no technical reason we can't do that. The secrets are defined in the task definitions: https://github.com/mozilla/firefox-translations-training/blob/c4b0d121985d84ed88aefc966df0ffbb1431a3ed/taskcluster/kinds/train-backwards/kind.yml#L70 The person who triggered the task (whether by opening a PR, making a...
I'm asking around other projects to see if they're seeing this as well.
Haven't seen reports of this elsewhere. @eu9ene - have you seen this on GPU workers only? Or also on the CPU workers?
Thanks; so it seems very unlikely to be related to specific worker images. @aerickson - I don't suppose you have any idea what's going on here?
We do see these fairly often - I would say maybe on 5-10% of the tasks run. I'll try to collect some data to help us analyze this better.
Here's failures by worker group: ``` defaultdict(, {'us-central1': 7, 'us-central1-a': 9, 'us-central1-b': 7, 'us-central1-c': 7, 'us-central1-f': 9, 'us-west1': 5, 'us-west1-a': 5, 'us-west1-b': 11}) ``` And here's timestamps when we hit...