firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Tracking does not supports override a run: wandb [409] run was previously created and deleted

Open vrigal opened this issue 4 months ago • 3 comments

Publication from a Taskcluster group using the --overide-runs agrument manages to delete the existing runs of a group, but fails creating new runs:

wandb: ERROR Error while calling W&B API: run teacher-1_dziji was previously created and deleted; try a new run name (<Response [409]>)

Note: It is the ID that conflicts here, and not the name as suggested by above message.

Furthermore, the client stays stuck during 90s

wandb.errors.CommError: Run initialization has timed out after 90.0 sec.

It is annoying because we cannot support identifying runs by unique ID (<name>_<group_id>) and allow overriding a run from an existing project. Unfortunately deleting all artifacts from the project does not seem to fix that. Eventually a quick fix would be to detect such exception and retry with a postfix (name and ID would then be teacher-1_dziji_1, teacher-1_dziji_2…) and it should work (except the display is not ideal and may be confusing, at least consider documenting it).

I think W&B disallow overriding a run because it keep the data to allow a restore of the deleted runs during 7 days (see this issue: https://github.com/wandb/wandb/issues/6395). In the worst scenario we could clean everything (with the --overide-runs) now, then hope reuploading in a week works. It would be interesting to contact the W&B team about this.

I suppose we never detected it since using similar name and IDs for identifying runs in the bar charts.

vrigal avatar Oct 11 '24 09:10 vrigal