dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc queue logs: "No output logs found" for failed experiment thus unable to debug failure

Open oscarkey opened this issue 2 years ago • 3 comments

Bug Report

Description

I am using DVC to run a grid search of about 300 experiments. Half way through the queue one of the experiments fails and DVC stops running the rest of the queue. I would like to debug why this experiment has failed. I am unable to view the logs:

> dvc queue logs 650da72
ERROR: No output logs found for experiment '650da72'

Or apply the experiment config to run it again:

> dvc exp apply 650da72
ERROR: '650da72a154137e2102249c5d7998075fe3c6ff9' does not appear to be an experiment commit.

(I guess this problem is https://github.com/iterative/dvc/issues/8182) Both dvc queue logs and dvc exp apply work for experiments marked as Success, Running, or Queued, it's just the failed one that doesn't work.

Is there a way to find out which experiment failed, to debug it?

Reproduce

  1. Queue about 300 experiments by calling dvc exp run --queue -S param1=blah ... from a Python script
  2. Run dvc queue start
  3. Several hours later, run dvc queue status
Task     Name       Created       Status
39bde48             Aug 25, 2022  Running
5b00534             Aug 25, 2022  Queued
...
cca1543             Aug 25, 2022  Queued
650da72             Aug 25, 2022  Failed
9a02fb0  exp-342ce  Aug 25, 2022  Success
...
86d21aa  exp-3f8bd  Aug 25, 2022  Success

Worker status: 1 active, 0 idle

Note that 39bde48 does not appear to be running as CPU usage is zero.

  1. Run dvc queue logs cca1543

Expected

Some logs indicating why the experiment failed.

Environment information

DVC version: 2.18.0 (conda)
---------------------------------
Platform: Python 3.9.12 on Linux-5.4.0-1086-gcp-x86_64-with-glibc2.27
Supports:
        gs (gcsfs = 2022.7.1),
        webhdfs (fsspec = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda1
Caches: local
Remotes: gs
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git

Additional Information (if any): Here is the verbose output:

> dvc queue logs -v 650da72
2022-08-26 08:51:50,011 ERROR: No output logs found for experiment '650da72'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/manager.py", line 51, in __getitem__
    return ProcessInfo.load(info_path)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/process.py", line 39, in load
    with open(filename, "r", encoding="utf-8") as fobj:
FileNotFoundError: [Errno 2] No such file or directory: '/home/oscar/socpercent_220708_cornbelt_benchmark/.dvc/tmp/exps/run/650da72a154137e2102249c5d7998075fe3c6ff9/650da72a154137e2102249c5d7998075fe3c6ff9.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/queue/celery.py", line 347, in logs
    proc_info = self.proc[queue_entry.stash_rev]
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc_task/proc/manager.py", line 53, in __getitem__
    raise KeyError from exc
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/commands/queue/logs.py", line 14, in run
    self.repo.experiments.celery_queue.logs(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/queue/celery.py", line 349, in logs
    raise DvcException(f"No output logs found for experiment '{rev}'")
dvc.exceptions.DvcException: No output logs found for experiment '650da72'
------------------------------------------------------------
2022-08-26 08:51:50,013 DEBUG: Analytics is disabled.

oscarkey avatar Aug 26 '22 08:08 oscarkey

dvc queue logs works for failed experiments. At least with 2.18.1 and higher.

macio232 avatar Aug 26 '22 09:08 macio232

Here are the logs from .dvc/tmp/exps/celery/dvc-exp-worker-1.out: dvc-exp-worker-1.out.txt The start of the stack trace is

[2022-08-25 22:07:00,267: ERROR/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[1a24d875-b564-4027-9aae-7569622614cc] raised unexpected: TypeError("open() missing required argument 'flags' (pos 2)")
Traceback (most recent call last):
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/local.py", line 135, in from_stash_entry
    executor = cls._from_stash_entry(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 248, in _from_stash_entry
    executor.init_git(repo.scm, branch=entry.branch)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/executor/local.py", line 78, in init_git
    push_refspec(scm, self.git_url, refspec, refspec)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dvc/repo/experiments/utils.py", line 138, in push_refspec
    results = scm.push_refspecs(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 564, in push_refspecs
    client.send_pack(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/client.py", line 1438, in send_pack
    *generate_pack_data(have, want, ofs_delta=True)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 324, in generate_pack_data
    self.generate_pack_contents(have, want, shallow, progress)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 307, in generate_pack_contents
    missing = self.find_missing_objects(have, want, shallow, progress)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 271, in find_missing_objects
    finder = MissingObjectFinder(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 1295, in __init__
    missing_commits, common_commits = object_store._collect_ancestors(
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 374, in _collect_ancestors
    cmt = self[e]
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 144, in __getitem__
    type_num, uncomp = self.get_raw(sha)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/object_store.py", line 576, in get_raw
    return pack.get_raw(sha)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2143, in get_raw
    offset = self.index.object_index(sha1)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2064, in index
    self._idx = self._idx_load()
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 2020, in <lambda>
    self._idx_load = lambda: load_pack_index(self._idx_path)
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/pack.py", line 280, in load_pack_index
    with GitFile(path, "rb") as f:
  File "/home/oscar/miniconda3/lib/python3.9/site-packages/dulwich/file.py", line 94, in GitFile
    return io.open(filename, mode, bufsize)
OSError: [Errno 24] Too many open files: '/home/oscar/socpercent_220708_cornbelt_benchmark/.git/objects/pack/pack-e6cba4e15a30783eca405f9d0e4cb95575750dce.idx'

During handling of the above exception, another exception occurred:
...

oscarkey avatar Aug 26 '22 12:08 oscarkey

This has probably to do with ulimit

macio232 avatar Aug 26 '22 12:08 macio232

@pmrowla Do you think this was resolved by https://github.com/iterative/dvc/pull/9089?

dberenbaum avatar Mar 02 '23 20:03 dberenbaum

Yes, this can be closed

pmrowla avatar Mar 03 '23 05:03 pmrowla