dvc exp show: --queued and --failed

dvc exp show should only include successfully completed experiments by default but should have options to include queued/failed items from the queue. (edit: exp show --queued/--failed)

Originally posted by @dberenbaum in https://github.com/iterative/dvc/issues/7592#issuecomment-1154119198

Jul 07 '22 14:07 dberenbaum

Need to coordinate with VS Code on these changes cc @mattseddon

Jul 07 '22 14:07 dberenbaum

Related to https://github.com/iterative/vscode-dvc/issues/1995 & https://github.com/iterative/vscode-dvc/issues/1996.

Jul 13 '22 20:07 mattseddon

Given the discussion of dvc exp being the primary interface over dvc queue, it might make more sense to include all of these by default and make the flags --hide-queued/--hide-failed.

Jul 14 '22 18:07 dberenbaum

@mattseddon I'm now working on this, but I found that there is still some data format required to be decided.

In the current json result of exp show there is no flag for failed experiments .

    "a4dedc07945e645a3b5ca0ede77330908688ea47": {
      "data": {
        "timestamp": "2022-08-26T15:55:17",
        "params": {
          "params.yaml": {
            "data": {
              "interval": 1
            }
          }
        },
        "deps": {},
        "outs": {
          "output.txt": {
            "hash": "c4ca4238a0b923820dcc509a6f75849b",
            "size": 1,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "out.txt": {
            "hash": null,
            "size": null,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          }
        },
        "queued": true,
        "running": false,
        "executor": null
      }
    }

We need a new flag for it. what do you prefer "success": false or "failed": true. Some other things to be noticed:

the default value to compatible with old versions.
for those failed experiments, you should also disregard the info in outs.

Sep 07 '22 12:09 karajan1001

Do we need separate fields for each state, or are they all mutually exclusive and we should have a single state field?

Sep 07 '22 19:09 dberenbaum

Do we need separate fields for each state, or are they all mutually exclusive and we should have a single state field?

All are OK for me but will give more trouble to those who are parsing the JSON themselves, inside the DVC the JSON parsing is always going with the JSON format. This is why I'm asking for advice from matt.

Sep 08 '22 02:09 karajan1001

Is completed/failed a boolean result? Can there be a partial failure? If it is a boolean then I'm happy with succeeded of failed. Will the error message will be nested somewhere within the dict? We will want to display as much information to the user as possible 👍🏻.

Sep 08 '22 05:09 mattseddon

Is completed/failed a boolean result? Can there be a partial failure? If it is a boolean then I'm happy with succeeded of failed. Will the error message will be nested somewhere within the dict? We will want to display as much information to the user as possible 👍🏻.

For the error message, it could be gotten through the dvc queue logs, this command reads the stored output of the executor. It's not hard for me to read the message from this file and paste them into the JSON, the only question is What should I put into the JSON, is the last line of the output enough? And for current discussion the final JSON format would be something like

"sha_rev": {
  "data": {
    "timestamp": %YYYY-%mm-%ddT%HH-%MM-%SS,
    "params": {
...
      },
    "error_msg": "", (if added)
    },
    "deps": {
...
},
    "outs": {
...
    },
    "queued": bool,
    "running": bool,
    "success":bool,
    "executor": ...
  }
}

and what do you think about using a single "status" field instead of three boolean ones? Because you might get a confused result with more than one true (of course it's a bug but it is still possible).

Sep 08 '22 06:09 karajan1001

The format that we expect for a record that is completely broken is:

"sha_rev": {
  "error": {
    "msg": error_msg,
    "type": ""
      },
    },
    ...
    "success":false,
  }
}

Where the error key replaces data.

However, if we know what the params/deps are we could simply put the error message into the missing metrics/outs files.

{
  "data": {
    "timestamp": "2022-09-15T11:07:57",
    "params": {
      "params.yaml": {
        "data": {
          "lr": 0.006,
          "weight_decay": 0,
          "epochs": 15
        }
      }
    },
    "deps": {
      "data/MNIST": {
        ...
      },
      ...
    },
    ...
    "metrics": {
      "metric_file": {
        "error": {
          "msg": error_msg
          "type": error_type
        }
      }
    },
  }
}

The second option is IMO more beneficial to users because they'd want to know which of the queued items have failed so that they can retry them. If we omit the param info and only show failed records they would have to reverse engineer/work out which ones have failed.

and what do you think about using a single "status" field instead of three boolean ones? Because you might get a confused result with more than one true (of course it's a bug but it is still possible).

I would be fine to merge the running/queued/success/pending fields into a single status field. This is a breaking change though so we would need to coordinate on the release. Anyone using an old version of the extension with a new version of the CLI is going to have a weird experience.

Sep 15 '22 02:09 mattseddon

The second option is IMO more beneficial to users because they'd want to know which of the queued items have failed so that they can retry them. If we omit the param info and only show failed records they would have to reverse engineer/work out which ones have failed.

It is more an execution failure instead of metric_file file format wrong. The error message is probably some callstack message. So I think

"sha_rev": {
  "error": {
    "msg": error_msg,
    "type": ""
      },
    },
    ...
    "success":false,
  }
}

Makes more sense.

I would be fine to merge the running/queued/success/pending fields into a single status field. This is a breaking change though so we would need to coordinate on the release. Anyone using an old version of the extension with a new version of the CLI is going to have a weird experience.

Another choice is to keep both

    "queued": bool,
    "running": bool,

and status for a time, the front end can first look if the status field exists and if not back to the old version of the parsing method. And remove the old API 1 or 2 versions later.

Sep 15 '22 09:09 karajan1001