vscode-dvc icon indicating copy to clipboard operation
vscode-dvc copied to clipboard

exp --queue and --run-all can hang and cause future exp show runs to hang

Open rogermparent opened this issue 4 years ago • 22 comments

The original Issue has since been solved, but there's still a related problem which this Issue now tracks. See this comment for info on the still open issue.


Original issue:

I had mentioned this when working on plots, but it seems this is more of a general problem that occurs on master as well.

After running one experiment, live updating experiments ~stops working~ delays for at least 5 and up to 15 checkpoints before dumping all of them at once, sometimes crashing VSCode in the process. You can see the output changes as well, showing nothing until the experiments finishes and pushing all the checkpoints at once. It almost looks like it could be a dvc issue, but I don't know how it would be considering it's related to the vscode session.

https://user-images.githubusercontent.com/9111807/134056404-8ef0d288-4184-446d-8d8c-c21f7678245b.mp4

This issue also happens when running dvc exp show on the CLI outisde of VSCode, with slightly different symptoms on the table itself where updates don't happen until a bit after the command is finished

https://user-images.githubusercontent.com/9111807/134057377-608bd480-3600-4cde-9ed9-7d88dbe8f12d.mp4

I've also check combinations of running from both CLI and VSCode, in both CLI -> VSCode and VSCode -> CLI. Table symptoms reflect the command run second.

rogermparent avatar Sep 20 '21 18:09 rogermparent

So, this is DVC issue?

shcheklein avatar Sep 20 '21 21:09 shcheklein

To me it looks like an issue with restarting from existing checkpoints, what happens if you increase the number of epochs to 15 or some number like that? I think that the 3 epochs happen so quickly that it seems like "nothing is sent until the experiment finishes". What happens if you change the params so that you are forcing another experiment to start?

Here is a demo from another repo:

https://user-images.githubusercontent.com/37993418/134087833-ed91d7fc-18e2-4dc3-b4e4-cb6f601fc0da.mov

We still have issues where commands are running into each other so things can be slow to update but live updates are not broken.

mattseddon avatar Sep 20 '21 23:09 mattseddon

Let me try to replicate with higher epochs and non-continuation, I remember the same thing happening when I did so when trying plots, but it's always good to confirm.

rogermparent avatar Sep 21 '21 00:09 rogermparent

With a few tries and a bit of luck, I finally got a demo of a 15 epoch run on two experiments that didn't crash.

https://user-images.githubusercontent.com/9111807/134100716-aa39ea7a-e61c-4f73-9282-ee67f4a9863f.mp4

On a few attempts, I saw what @mattseddon hypothesized with a delayed update before live update resumed, but even when that happened the delay was between 5 and 10 epochs, and occasionally VSCode crashed at the time where other successful attempts dumped those epochs and resumed. They all share the same commonality of the first run working totally fine, but subsequent runs having issues.

rogermparent avatar Sep 21 '21 02:09 rogermparent

@shcheklein It's really hard to pin down whether or not this actually is a DVC issue. At least for now, I'd lean towards no since symptoms can only be seen within VSCode.

rogermparent avatar Sep 21 '21 02:09 rogermparent

@rogermparent I cannot reproduce:

https://user-images.githubusercontent.com/37993418/134101351-be39868d-13b0-42bb-b543-5cf4f4d217b0.mov

https://user-images.githubusercontent.com/37993418/134101373-803e32ca-b78c-47c6-9e48-93460f5dd3ef.mov

When was the last time that you upgraded your .env (source .env/bin/activate && pip install -r requirements.txt -U)?

mattseddon avatar Sep 21 '21 02:09 mattseddon

@rogermparent wait a minute, HOW is exp show taking 121s to run?

image

When was the last time you garbage collected experiments... even then WAT! How long does that action take in the CLI? The problem is whatever is causing exp show to run for an eternity. Takes ~1.2s on my machine (and my machine is garbage).

mattseddon avatar Sep 21 '21 02:09 mattseddon

When was the last time that you upgraded your .env (source .env/bin/activate && pip install -r requirements.txt -U)?

I ran a git clean -fxd and reinstalled all dependencies before attempting that second recording, and just verified with those commands that my .env is up-to-date. It's worth noting that extension dev host + dvc reaches 13/16gb of my RAM, but doesn't seem to cross the threshold.

wait a minute, HOW is exp show taking 121s to run?

When was the last time you garbage collected experiments... even then WAT! How long does that action take in the CLI?

Great catch! It looks like exp show is starting alongside exp run, and idling until some point in the middle of exp run without ever failing like it does on your examples. Maybe I spoke too soon, and this could be a Linux-specific DVC issue related to commands running into each other- specifically run and show?

I ran GC quite a few times in the process, running it after that example (and a couple experiments later) only cleaned 4 experiments.

rogermparent avatar Sep 21 '21 02:09 rogermparent

relates to https://github.com/iterative/dvc/issues/5739

mattseddon avatar Sep 22 '21 22:09 mattseddon

I've been able to recreate this issue on Codespaces, leading me to believe it's an issue related to running on Linux.

https://user-images.githubusercontent.com/9111807/134733165-655bc0db-9cee-4e3d-ac66-9397053eb645.mp4

Bad news that it's happening, good news that it should be much easier for everyone to replicate this way!

rogermparent avatar Sep 24 '21 20:09 rogermparent

As discussed in the meeting, I tried to replicate the bug using only the CLI and not VSCode. I couldn't successfully do so. Here's what I did:

  • run dvc status --show-json, dvc diff --show-json, and dvc exp show --show-json in a loop with watch -n0 (re-run as soon as the last update ends) in three separate terminals to simulate what we assumed would be causing lock issues with dvc
  • dvc exp run in a fourth terminal
  • Observe the dvc exp show terminal for delays in changes

Recording:

https://user-images.githubusercontent.com/9111807/138955175-928c7005-5d72-44d9-a3b0-a7dd0299a9d2.mp4

I couldn't see any delays like we're experiencing in VSCode, only a couple seconds at most. Due to this, maybe we can hypothesize the issue has something to do with VSCode being a single-threaded Node application that's executing and reading the output of the dvc calls? We've run into issues before with Node's process execution behaving differently on different platforms, so that could explain the platform-specific nature of this Issue.

rogermparent avatar Oct 26 '21 20:10 rogermparent

I am now seeing a similar issue where exp show hung for a long time, here are the steps that I went through

  1. queue 4 experiments
  2. run all experiments
  3. after first experiment completed exp show began to hang
  4. after killing the session exp show began running in a loop
  5. my repo is now in this weird state where exp show is now "touching" vscode-dvc/.git/refs/exps/93/3131de80b564742afb0795f683af6b955b327b/exp-a593b & vscode-dvc/.git/refs/exps/exec/EXEC_CHECKPOINT every time it runs (even after running dvc gc -T) this is causing an infinite loop of exp show commands

still investigating

mattseddon avatar Dec 17 '21 01:12 mattseddon

@efiop I now have a minimal reproducible example for this scenario. I have been able to recreate with two terminal sessions and without VS Code doing anything in the background.

Here are the steps:

  1. In our vscode-dvc/demo repo set EPOCHS = 36 in train.py and queue 4 experiments:
dvc exp run --queue -S lr=0.001
dvc exp run --queue -S lr=0.005
dvc exp run --queue -S lr=0.0001
dvc exp run --queue -S lr=0.0005
  1. dvc exp run --run-all
  2. Wait for the first experiment to finish.
  3. At the start of the second run dvc exp show --show-json will hang and keep the .dvc/tmp/lock
  4. As far as I can see dvc exp show --show-json will never return
  5. dvc exp run --run-all gets to the final step but after finishing the final experiment I left it running for 30+ minutes before I gave up and killed it, it was burning a lot of CPU and running 4 threads.

After reproducing for the second time I killed the --run-all process earlier and calls to dvc exp show then hung indefinitely. The exp ref tree looked like this:

├── 93
│   └── 3131de80b564742afb0795f683af6b955b327b
│       ├── exp-47c59
│       └── exp-d028d
├── exec
│   └── EXEC_CHECKPOINT
└── stash

and .dvc/tmp/exps looked like this:

├── run
│   ├── 84d95dd90354e1113745e031c91b1b472d097040
│   │   └── 84d95dd90354e1113745e031c91b1b472d097040.run
│   ├── abd5144a227994494fc6bd8e08a7ac3eda20f686
│   │   └── abd5144a227994494fc6bd8e08a7ac3eda20f686.run
│   ├── c55ad5478175c4a4cc972720c99a0fce02e0467e
│   │   └── c55ad5478175c4a4cc972720c99a0fce02e0467e.run
│   ├── d27f02d47c5b0530cbe515fd9abc3d0cf2347df9
│   │   └── d27f02d47c5b0530cbe515fd9abc3d0cf2347df9.run
│   └── e4d733134154805708e61661df783a58ab89c5bf
│       └── e4d733134154805708e61661df783a58ab89c5bf.run
├── tmp2t9f3aqn
│   ├── ...whole repo
├── tmp3jm778wy
│   ├── ...whole repo
├── tmppxj8hjr9
│   ├── ...whole repo
├── tmptwluxju2
│   ├── ...whole repo
├── tmpux4x5eii
│   ├── ...whole repo
├── tmpv_sqh75p
│   ├── ...whole repo

After running dvc exp gc -T dvc exp show began working as expected again.

I tried to reproduce the issue by queueing the following in example-dvc-experiments but the process completed as expected after 44m 36s:

dvc exp run --queue -S params.yaml:train.epochs=20 -S params.yaml:model.conv_units=16
dvc exp run --queue -S params.yaml:train.epochs=20 -S params.yaml:model.conv_units=16
dvc exp run --queue -S params.yaml:train.epochs=20 -S params.yaml:model.conv_units=16
dvc exp run --queue -S params.yaml:train.epochs=20 -S params.yaml:model.conv_units=16

This leads me to believe this has something to do with checkpoints being present in the table.

Note: At some point in the early stages of fighting against this bug I got the repo in a state where dvc exp show --show-json was triggering the file watcher that causes dvc exp show --show-json to run by touching .git/refs/exps/exec/EXEC_CHECKPOINT and another ref and my computer nearly melted.

mattseddon avatar Dec 17 '21 04:12 mattseddon

I tried to recreate with dvc-checkpoints-mnist.

I set the epochs to 36 and queued 4 experiments twice. On the 7th experiment run, the experiment finished and returned ERROR: too many open files, please visit <https://error.dvc.org/many-files> to see how to handle this problem. Trying to run dvc exp run --run-all gives the same error every time now.

mattseddon avatar Dec 17 '21 05:12 mattseddon

I am now seeing a similar issue where exp show hung for a long time, here are the steps that I went through

  1. my repo is now in this weird state where exp show is now "touching" vscode-dvc/.git/refs/exps/93/3131de80b564742afb0795f683af6b955b327b/exp-a593b & vscode-dvc/.git/refs/exps/exec/EXEC_CHECKPOINT every time it runs (even after running dvc gc -T) this is causing an infinite loop of exp show commands

Note: At some point in the early stages of fighting against this bug I got the repo in a state where dvc exp show --show-json was triggering the file watcher that causes dvc exp show --show-json to run by touching .git/refs/exps/exec/EXEC_CHECKPOINT and another ref and my computer nearly melted.

So exp show will touch git refs in this scenario, because it works by git fetching refs from each of the temporary workspaces (where your 4 experiments are running) into the main git workspace.

What (directories/files) does the file watcher currently trigger exp show on? Is it possible to exclude paths from being watched? It seems like we may need to have a discussion on what exactly needs to be monitored (for the repo to be considered "changed").

pmrowla avatar Dec 18 '21 02:12 pmrowla

@pmrowla we can (should) definitely discuss that. Any ideas on why exp show is hanging?

mattseddon avatar Dec 18 '21 03:12 mattseddon

Some more info from running queued experiments.

Queueing a set of 4 experiments with 20 epochs that don't collide with the existing experiment in the workspace (iterative/dvc#7126) led to the following behaviour:

dvc exp show --show-json hung twice during --run-all. Once at the beginning of the 3rd experiment and once at the beginning of the 4th. With the total following run times - dvc exp show --show-json - COMPLETED (236861ms) dvc exp show --show-json - COMPLETED (307770ms) For reference exp show usually takes ~2 seconds in this repo.

Before the 1st hang, the following appeared in the log:

dvc exp show --show-json - FAILED with code 255 (6525ms)
ERROR: unexpected error - b'2f5b6e9b70e2adeb46ca9c9bc58215384ad04d8e'

& Not long before the 2nd hang, this message appeared in the log:

dvc exp show --show-json - FAILED with code 255 (5362ms)
ERROR: unexpected error - b'b84b556bcf58bf75dddafdcdd43b87774e11fa7e'

After all the experiments seemed to have been completed the workspace was in this state:

https://user-images.githubusercontent.com/37993418/146692288-18aa33ce-6581-4ae3-997f-e19adae11086.mov

The yellow point on the LHS of the plot (two points in the table) represents the 3rd experiment run. After a further 10 minutes of waiting the "broken" state appeared to resolve and all of the remaining refs made their way into the workspace:

It should also be noted that experiments 1 and 2 only showed data for the first 18 epochs up until not long before I started recording the screen.

https://user-images.githubusercontent.com/37993418/146692686-fa889b81-7238-46d5-884b-7d2d0d78d718.mov

The total run time for the 4 experiments was -> dvc exp run --run-all - COMPLETED (1774709ms).

From reading https://github.com/iterative/dvc/issues/7126#issuecomment-992157559 I can see that --queue/--run-all are mostly experimental and not a long term solution to this problem. I think we will need to make a call as to whether or not we are going to include queueing/running all experiments within the initial release as the behaviour can be very confusing. We can discuss this in the various meetings this week.

mattseddon avatar Dec 19 '21 22:12 mattseddon

@pmrowla we can (should) definitely discuss that. Any ideas on why exp show is hanging?

I'm not sure what makes it hang yet, but it wasn't really designed to be spammed in this way (especially if there's an infinite loop of more exp show calls being made).

pmrowla avatar Dec 21 '21 10:12 pmrowla

For one reason or another, this issue no longer occurs for me. I haven't been consistently checking so it may have been resolved a bit ago.

https://user-images.githubusercontent.com/9111807/158500893-b8d07217-532a-4e8a-bd97-33f89b1b94f6.mp4

This issue could likely be closed, though I'm not sure if @mattseddon is still experiencing his similar issue.

rogermparent avatar Mar 16 '22 01:03 rogermparent

@rogermparent see https://github.com/iterative/vscode-dvc/issues/828#issuecomment-996437072 I could only recreate the issue using the queue. We might want to update the title of the issue but we should leave it open for now.

mattseddon avatar Mar 16 '22 04:03 mattseddon

We might want to update the title of the issue but we should leave it open for now.

I've updated the title and OP to point to your comment.

rogermparent avatar Mar 16 '22 16:03 rogermparent

Could be mitigated by https://github.com/iterative/dvc/pull/7574 combined with https://github.com/iterative/vscode-dvc/pull/1598

mattseddon avatar Apr 21 '22 07:04 mattseddon

@pmrowla @karajan1001 @mattseddon is it resolved now? it not, does it still hang or it's not about something else?

shcheklein avatar Nov 05 '22 20:11 shcheklein

I hadn't noticed any hanging caused by exp --queue recently. For the exp run --run-all the fetching progress might cause exp show to freeze a few seconds, at the result collections progress.

karajan1001 avatar Nov 08 '22 02:11 karajan1001

Seems to be fixed. We can reopen if it is still an issue.

mattseddon avatar Nov 10 '22 18:11 mattseddon