dvc
dvc copied to clipboard
dvc queue logs: "No output logs found" for failed experiment thus unable to debug failure
Bug Report
Description
I am using DVC to run a grid search of about 300 experiments. Half way through the queue one of the experiments fails and DVC stops running the rest of the queue. I would like to debug why this experiment has failed. I am unable to view the logs:
> dvc queue logs 650da72
ERROR: No output logs found for experiment '650da72'
Or apply the experiment config to run it again:
> dvc exp apply 650da72
ERROR: '650da72a154137e2102249c5d7998075fe3c6ff9' does not appear to be an experiment commit.
(I guess this problem is https://github.com/iterative/dvc/issues/8182)
Both dvc queue logs
and dvc exp apply
work for experiments marked as Success
, Running
, or Queued
, it's just the failed one that doesn't work.
Is there a way to find out which experiment failed, to debug it?
Reproduce
- Queue about 300 experiments by calling
dvc exp run --queue -S param1=blah ...
from a Python script - Run
dvc queue start
- Several hours later, run
dvc queue status
Task Name Created Status
39bde48 Aug 25, 2022 Running
5b00534 Aug 25, 2022 Queued
...
cca1543 Aug 25, 2022 Queued
650da72 Aug 25, 2022 Failed
9a02fb0 exp-342ce Aug 25, 2022 Success
...
86d21aa exp-3f8bd Aug 25, 2022 Success
Worker status: 1 active, 0 idle
Note that 39bde48
does not appear to be running as CPU usage is zero.
- Run
dvc queue logs cca1543
Expected
Some logs indicating why the experiment failed.
Environment information
DVC version: 2.18.0 (conda)
---------------------------------
Platform: Python 3.9.12 on Linux-5.4.0-1086-gcp-x86_64-with-glibc2.27
Supports:
gs (gcsfs = 2022.7.1),
webhdfs (fsspec = 2022.7.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda1
Caches: local
Remotes: gs
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git
Additional Information (if any): Here is the verbose output:
> dvc queue logs -v 650da72
2022-08-26 08:51:50,011 ERROR: No output logs found for experiment '650da72'
------------------------------------------------------------
Traceback (most recent call last):
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/manager.py", line 51, in __getitem__
return ProcessInfo.load(info_path)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/process.py", line 39, in load
with open(filename, "r", encoding="utf-8") as fobj:
FileNotFoundError: [Errno 2] No such file or directory: '/home/oscar/socpercent_220708_cornbelt_benchmark/.dvc/tmp/exps/run/650da72a154137e2102249c5d7998075fe3c6ff9/650da72a154137e2102249c5d7998075fe3c6ff9.json'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/queue/celery.py", line 347, in logs
proc_info = self.proc[queue_entry.stash_rev]
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/manager.py", line 53, in __getitem__
raise KeyError from exc
KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/commands/queue/logs.py", line 14, in run
self.repo.experiments.celery_queue.logs(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/queue/celery.py", line 349, in logs
raise DvcException(f"No output logs found for experiment '{rev}'")
dvc.exceptions.DvcException: No output logs found for experiment '650da72'
------------------------------------------------------------
2022-08-26 08:51:50,013 DEBUG: Analytics is disabled.
dvc queue logs
works for failed experiments. At least with 2.18.1
and higher.
Here are the logs from .dvc/tmp/exps/celery/dvc-exp-worker-1.out
: dvc-exp-worker-1.out.txt
The start of the stack trace is
[2022-08-25 22:07:00,267: ERROR/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[1a24d875-b564-4027-9aae-7569622614cc] raised unexpected: TypeError("open() missing required argument 'flags' (pos 2)")
Traceback (most recent call last):
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/local.py", line 135, in from_stash_entry
executor = cls._from_stash_entry(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 248, in _from_stash_entry
executor.init_git(repo.scm, branch=entry.branch)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/local.py", line 78, in init_git
push_refspec(scm, self.git_url, refspec, refspec)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/utils.py", line 138, in push_refspec
results = scm.push_refspecs(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
result = func(*args, **kwargs)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 564, in push_refspecs
client.send_pack(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/client.py", line 1438, in send_pack
*generate_pack_data(have, want, ofs_delta=True)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 324, in generate_pack_data
self.generate_pack_contents(have, want, shallow, progress)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 307, in generate_pack_contents
missing = self.find_missing_objects(have, want, shallow, progress)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 271, in find_missing_objects
finder = MissingObjectFinder(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 1295, in __init__
missing_commits, common_commits = object_store._collect_ancestors(
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 374, in _collect_ancestors
cmt = self[e]
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 144, in __getitem__
type_num, uncomp = self.get_raw(sha)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 576, in get_raw
return pack.get_raw(sha)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2143, in get_raw
offset = self.index.object_index(sha1)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2064, in index
self._idx = self._idx_load()
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2020, in <lambda>
self._idx_load = lambda: load_pack_index(self._idx_path)
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 280, in load_pack_index
with GitFile(path, "rb") as f:
File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/file.py", line 94, in GitFile
return io.open(filename, mode, bufsize)
OSError: [Errno 24] Too many open files: '/home/oscar/socpercent_220708_cornbelt_benchmark/.git/objects/pack/pack-e6cba4e15a30783eca405f9d0e4cb95575750dce.idx'
During handling of the above exception, another exception occurred:
...
This has probably to do with ulimit
@pmrowla Do you think this was resolved by https://github.com/iterative/dvc/pull/9089?
Yes, this can be closed