dvc icon indicating copy to clipboard operation
dvc copied to clipboard

exp list: does not show new experiments

Open mstrupp opened this issue 2 years ago • 6 comments

Bug Report

Description

When the terminal is killed while dvc exp run is executing, the ref .git/refs/exps/exec/EXEC_BASELINE is not removed. Then when a git commit is made, git might pack the references to optimize performance. Now, dvc exp list is stuck with the list of experiments before the commit and will not update when new experiments are run.

This also affects the experiments table in the vscode extension.

Reproduce

  1. git init
  2. dvc init
  3. dvc stage add -n prepare -d prepare.py python prepare.py
  4. create file prepare.py and write a program that takes some time (e.g. time.sleep(10))
  5. git add .
  6. git commit -m "commit 1"
  7. dvc exp run
  8. while running: Kill the terminal (not via ctrl+c but by closing the terminal)
  9. edit prepare.py (to make dvc exp run execute the pipeline again)
  10. git add .
  11. git commit -m "commit 2"
  12. git pack-refs --all: when committing, git sometimes does "git pack-refs" for optimization. It can happen right here. To simulate the automatic packing, run git pack-refs --all
  13. dvc exp run
  14. dvc exp list

Expected

dvc exp list should show the experiment from 13. Instead, it returns nothing. It only shows the experiment with dvc exp list -A

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.38.1 (exe)
---------------------------------
Platform: Python 3.10.9 on Windows-10-10.0.19045-SP0
Subprojects:

Supports:
        azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0, pyarrow = 10.0.1),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.1.0),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2022.11.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git

mstrupp avatar Mar 28 '23 13:03 mstrupp

Hi @mstrupp , could you try upgrading to the latest DVC version?

daavoo avatar Mar 28 '23 13:03 daavoo

Hi @daavoo, thank you for the response. I upgraded dvc but the problem still exists.

$ dvc doctor
DVC version: 2.51.0 (pip)
-------------------------
Platform: Python 3.10.8 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 0.44.1
        dvc_objects = 0.21.1
        dvc_render = 0.3.1
        dvc_task = 0.2.0
        scmrepo = 0.1.17
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git
Repo.site_cache_dir: C:\ProgramData\iterative\dvc\Cache\repo\5db899e06b13bbca5a630f6ac0c2cbfd

mstrupp avatar Mar 28 '23 14:03 mstrupp

The workaround here would be to remove the exec ref with

git update-ref -d refs/exps/exec/EXEC_BASELINE

The issue is that we have logic to account for when HEAD has moved during experiment execution, where exp show will then show experiments derived from EXEC_BASELINE instead of HEAD. We could consider updating the logic to check and see if there is also an active workspace run (and cleanup the ref when there is not), but this would also introduced additional overhead into every dvc command that uses resolve_rev.

pmrowla avatar Apr 04 '23 02:04 pmrowla

@pmrowla Is it needed for anything besides exp list and exp show? Can we do it only in those commands?

dberenbaum avatar Apr 04 '23 14:04 dberenbaum

@dberenbaum it's needed for every DVC command that has any kind of parameter that can be set to (or defaults to) HEAD (so any diff/show command)

Should also note that if we drop checkpoints support we could also consider just dropping this behavior as well. HEAD is still moved for regular experiments but we restore it shortly afterwards when the experiment run ends. The main issue here is that for checkpoints, HEAD is moved to the most recently generated checkpoint commit. (We may not actually be able to drop this entirely though since tools like vscode could still try to run DVC commands before HEAD is restored at the end of a regular exp run)

pmrowla avatar Apr 05 '23 06:04 pmrowla

Thanks for the suggested workaround @pmrowla.

Unfortunaly, the user doesn't realize when the problem occurs and the workaround should be applied. DVC happily shows the experiments before EXEC_BASELINE. The user expects to see the new experiments but never realizes why they are not shown.

mstrupp avatar Apr 14 '23 08:04 mstrupp