Ben Hearsum (he/him)

Results 200 comments of Ben Hearsum (he/him)

That seems sensible to me, yeah. The only reason I put the total chunks in the name was because it's what we do in chunked jobs on other projects. If...

> I think the cache `teacher-ensemble` needs to be removed as well: > > ``` > cache: > from-parameters: > teacher-ensemble: training_config.experiment.teacher-ensemble > ``` Yeah, good call. Come to think...

> Looks good to me. FWIW - Ben is refactoring this `mounts` stuff on #546 We'll see! I'm not sure if I will be refactoring that in advance, or following...

I don't think there's anything we can do to make this better in this repo nor taskgraph. This is a worker issue that's been filed as https://github.com/taskcluster/taskcluster/issues/6894

There's no technical reason we can't do that. The secrets are defined in the task definitions: https://github.com/mozilla/firefox-translations-training/blob/c4b0d121985d84ed88aefc966df0ffbb1431a3ed/taskcluster/kinds/train-backwards/kind.yml#L70 The person who triggered the task (whether by opening a PR, making a...

I'm asking around other projects to see if they're seeing this as well.

Haven't seen reports of this elsewhere. @eu9ene - have you seen this on GPU workers only? Or also on the CPU workers?

Thanks; so it seems very unlikely to be related to specific worker images. @aerickson - I don't suppose you have any idea what's going on here?

We do see these fairly often - I would say maybe on 5-10% of the tasks run. I'll try to collect some data to help us analyze this better.

Here's failures by worker group: ``` defaultdict(, {'us-central1': 7, 'us-central1-a': 9, 'us-central1-b': 7, 'us-central1-c': 7, 'us-central1-f': 9, 'us-west1': 5, 'us-west1-a': 5, 'us-west1-b': 11}) ``` And here's timestamps when we hit...