dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc exp run --run-all` results in `ERROR: unexpected error`

Open ivyleavedtoadflax opened this issue 3 years ago • 3 comments

Bug Report

Description

After queuing up a number of experiments that I can see with dvc exp show:

Experiment Created State eval_loss ...
workspace - - 0.051839 ...
longdocs2 May 16, 2022 - 0.051839 ...
├── a9a2a52 May 16, 2022 Queued - ...
├── a9362c3 May 16, 2022 Queued - ...
├── a412093 May 16, 2022 Queued - ...
├── ceebf27 May 16, 2022 Queued - ...
├── e09f285 May 16, 2022 Queued - ...
├── f58aa90 May 16, 2022 Queued - ...
├── 1be2ffe May 16, 2022 Queued - ...
├── b62c559 May 16, 2022 Queued - ...
├── 7aa60b9 May 16, 2022 Queued - ...
├── 97fb27f May 16, 2022 Queued - ...
├── c1f5135 May 16, 2022 Queued - ...
├── 6fa4dda May 16, 2022 Queued - ...
├── a74abe4 May 16, 2022 Queued - ...
├── 949343f May 16, 2022 Queued - ...
├── 0b49a7b May 16, 2022 Queued - ...
├── cfe8b2c May 16, 2022 Queued - ...
├── 2530894 May 16, 2022 Queued - ...
├── fd04249 May 16, 2022 Queued - ...
├── 4c5a546 May 16, 2022 Queued - ...
├── 1aeb3f1 May 16, 2022 Queued - ...
├── 294699c May 16, 2022 Queued - ...
├── 831a18b May 16, 2022 Queued - ...
├── ab811df May 16, 2022 Queued - ...
├── 97fd1b5 May 16, 2022 Queued - ...
├── b1a714a May 16, 2022 Queued - ...
├── c7b2795 May 16, 2022 Queued - ...
├── ee90f65 May 16, 2022 Queued - ...
├── 9f9584b May 16, 2022 Queued - ...
├── 951c4bb May 16, 2022 Queued - ...
├── f545d49 May 16, 2022 Queued - ...
└── 910dcc0 May 16, 2022 Queued - ...

I get the following when I run dvc exp run --run-all

$ dvc exp run --run-all
ERROR: unexpected error                                               

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Reproduce

It's difficult to know precisely how to reproduce this, as sometimes it works, and sometimes not, nor could I reproduce on a toy example, but in principal:

  1. dvc init
  2. dvc exp run --queue -S
  3. repeat multiple times
  4. dvc exp run --run-all

Expected

I expected dvc exp to run my experiments, or at least offer a useful error message.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1005-aws-x86_64-with-glibc2.35
Supports:
        hdfs (fsspec = 2022.3.0, pyarrow = 8.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Additional Information (if any): Output of dvc exp run --run-all --verbose

2022-05-17 04:19:46,090 DEBUG: Reproducing experiment revs 'a9a2a52, a9362c3, a412093, ceebf27, e09f285, f58aa90, 1be2ffe, b62c559, 7aa60b9, 97fb27f, c1f5135, 6fa4dda, a74abe4, 949343f, 0b49a7b, cfe8b2c, 2530894, fd04249, 4c5a546, 1aeb3f1, 294699c, 831a18b, ab811df, 97fd1b5, b1a714a, c7b2795, ee90f65, 9f9584b, 951c4bb, f545d49, 910dcc0'
2022-05-17 04:19:46,234 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpio0j1h9t/.dvc/config.local'
2022-05-17 04:19:46,234 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpio0j1h9t'
2022-05-17 04:19:46,347 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmp522qrkpe/.dvc/config.local'
2022-05-17 04:19:46,347 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmp522qrkpe'
2022-05-17 04:19:46,460 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpsqxqe4a1/.dvc/config.local'
2022-05-17 04:19:46,461 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpsqxqe4a1'
2022-05-17 04:19:46,570 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpfcqw2zd4/.dvc/config.local'
2022-05-17 04:19:46,570 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpfcqw2zd4'
2022-05-17 04:19:46,681 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmprg6sp9y9/.dvc/config.local'
2022-05-17 04:19:46,682 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmprg6sp9y9'
2022-05-17 04:19:46,795 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmptikvz81f/.dvc/config.local'
2022-05-17 04:19:46,795 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmptikvz81f'
2022-05-17 04:19:46,907 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpgwal6ofe/.dvc/config.local'
2022-05-17 04:19:46,907 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpgwal6ofe'
2022-05-17 04:19:47,021 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpq2mhffva/.dvc/config.local'
2022-05-17 04:19:47,021 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpq2mhffva'
2022-05-17 04:19:47,133 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpcaqst3h0/.dvc/config.local'
2022-05-17 04:19:47,133 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpcaqst3h0'
2022-05-17 04:19:47,245 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpyqu6eet0/.dvc/config.local'
2022-05-17 04:19:47,245 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpyqu6eet0'
2022-05-17 04:19:47,361 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpt_gclmlq/.dvc/config.local'
2022-05-17 04:19:47,362 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpt_gclmlq'
2022-05-17 04:19:47,477 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpv5idmb5e/.dvc/config.local'
2022-05-17 04:19:47,477 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpv5idmb5e'
2022-05-17 04:19:47,587 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpljs_i66o/.dvc/config.local'
2022-05-17 04:19:47,587 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpljs_i66o'
2022-05-17 04:19:47,699 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmprogu65u0/.dvc/config.local'
2022-05-17 04:19:47,699 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmprogu65u0'
2022-05-17 04:19:47,809 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpmpgofdf4/.dvc/config.local'
2022-05-17 04:19:47,809 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpmpgofdf4'
2022-05-17 04:19:47,920 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmppftuyscl/.dvc/config.local'
2022-05-17 04:19:47,921 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmppftuyscl'
2022-05-17 04:19:48,034 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpdipuv88o/.dvc/config.local'
2022-05-17 04:19:48,034 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpdipuv88o'
2022-05-17 04:19:48,144 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmp2tqp8tnt/.dvc/config.local'
2022-05-17 04:19:48,144 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmp2tqp8tnt'
2022-05-17 04:19:48,255 DEBUG: Writing experiments local config '/home/matt/project/.dvc/tmp/exps/tmpea2ejj86/.dvc/config.local'
2022-05-17 04:19:48,255 DEBUG: Init temp dir executor in '/home/matt/project/.dvc/tmp/exps/tmpea2ejj86'
2022-05-17 04:19:48,559 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/cli/__init__.py", line 90, in main
    ret = cmd.do_run()
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/commands/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 825, in run
    return run(self.repo, *args, **kwargs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 28, in run
    return repo.experiments.reproduce_queued(jobs=jobs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 457, in reproduce_queued
    results = self._reproduce_revs(**kwargs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 53, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 635, in _reproduce_revs
    manager = manager_cls.from_stash_entries(
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 119, in from_stash_entries
    manager._enqueue_stash_entries(scm, repo, to_run, **kwargs)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 147, in _enqueue_stash_entries
    self.enqueue(stash_rev, executor)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 70, in enqueue
    assert rev not in self
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
    return self.fs.reflink(from_info, to_info)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
    return System.reflink(path1, path2)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 95] Operation not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/matt/project/.env/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-05-17 04:19:48,560 DEBUG: Removing '/home/matt/.V4NdZ3uXiSsszXYSj6WPvF.tmp'
2022-05-17 04:19:48,560 DEBUG: Removing '/home/matt/.V4NdZ3uXiSsszXYSj6WPvF.tmp'
2022-05-17 04:19:48,561 DEBUG: Removing '/home/matt/.V4NdZ3uXiSsszXYSj6WPvF.tmp'
2022-05-17 04:19:48,561 DEBUG: Removing '/home/matt/project/.dvc/cache/.KZ4Zu7TEA7FBRtSDWNpDgQ.tmp'
2022-05-17 04:19:48,564 DEBUG: Version info for developers:
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-1005-aws-x86_64-with-glibc2.35
Supports:
	hdfs (fsspec = 2022.3.0, pyarrow = 8.0.0),
	webhdfs (fsspec = 2022.3.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.3.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-05-17 04:19:48,566 DEBUG: Analytics is enabled.
2022-05-17 04:19:48,622 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp2cpzut38']'
2022-05-17 04:19:48,624 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp2cpzut38']'

ivyleavedtoadflax avatar May 17 '22 04:05 ivyleavedtoadflax

This issue will likely go away when we move to the new queueing backend which should be relatively soon.

pmrowla avatar May 17 '22 04:05 pmrowla

For reference, it can be resolved by removing all experiments, and re-adding a smaller number and executing. It's also worth noting that this project involves quite a lot of data:

$ du -h -d 1 .dvc
231G    .dvc/cache
17G     .dvc/tmp
247G    .dvc

$ du -h -d 1 data
16G     data/raw
28K     data/processed
16G     data

ivyleavedtoadflax avatar May 17 '22 04:05 ivyleavedtoadflax

@ivyleavedtoadflax could you check please if this is resolved now?

shcheklein avatar Aug 06 '22 15:08 shcheklein

Hey sorry, I don't have access to this pipeline anymore, so I cannot repeat!

ivyleavedtoadflax avatar Oct 07 '22 16:10 ivyleavedtoadflax

closing as stale

pmrowla avatar Oct 09 '22 02:10 pmrowla