`dvc run exp --queue` gives unclear error without committed pipeline files
Bug Report
dvc exp run --queue: fails with "No such file or directory" on a cache path similar to .dvc/tmp/exps
Description
- It appears that
dvc exp run --queueonly works on DVC pipelines that have been previously committed to git - The error from this is not clear
When running a queued experiment with dvc exp run --queue, the job is queued and can be started with dvc queue start. However, it will fail with an error similar to ERROR: unexpected error - [Errno 2] No such file or directory: '[path to repo]/.dvc/tmp/exps/tmpabc123/....
The same experiment can be successfully run with dvc repro and dvc exp run. It appears to work once the pipeline is committed with git, which suggests this is either the cause or related to the issue but there is no mention of this in the error message.
Also, once the pipeline is committed and a new uncommitted change made, it calls into question which version of the pipeline is being run - the committed version, or the "dirty" version in the current directory.
Other minor issues
These can be separate issues if required.
- print statements are not shown in
--followunless explicitly flushed, though this may just be unavoidable celery behaviour - when running
dvc queue logs [task]on task that requires some slow dvc checkout startup, it gives a "no logs available" message, but using--followit givesERROR: unexpected error - : [Errno 2] No such file or directory: '/[path to repo]/.dvc/tmp/exps/run/[uuid]/[uuid].json. The same command later succeeds, presumably once the job has actually started. - the UTC timestamps shown by
dvc queue statusmove toMM DD, YYYYformat on the next day, which hides helpful time info, especially if you don't work in UTC (i.e. this can happen during the day)
Reproduce
- Create a new repo:
mkdir /tmp/example; cd /tmp/example; git init; dvc init; - Create a pipeline:
mkdir pipelineand copy the following as files: main.py
start_time = time.time()
while time.time() - start_time < 10:
print("Running at", time.time())
time.sleep(5)
dvc.yaml
stages:
main:
cmd: python3 main.py
- Run
dvc repro: pipeline runs successfully - Run
dvc exp run: experiment cannot be run without an existing git commit (not really a problem in most repos, plus has a good error message) - Run
git commit -m "Setup repo"to commit init files but not the pipeline files to create a least one commit in the repo - Run
dvc exp run: experiment runs successfully - Run
dvc exp run --queue: command runs successfully - Run
dvc queue start: command runs successfully - Run
dvc queue logs [task name]: shows "ERROR: unexpected error - [Errno 2] No such file or directory" - Run
dvc queue status: task shown as "Failed" - Commit pipeline files
- Re-run steps 7 and 8
- Run
dvc queue logs [task name]: no error, task running as expected - Run
dvc queue status: task eventually shown as success - Bonus: Run
dvc queue logs [task name] --follow: note that print statements are not shown until end of task, unlesssys.stdout.flush()is called
Expected
Either dvc exp run --queue should work without first committing the pipeline, or a clear error message should be shown indicating it needs to be committed first.
If a committed pipeline is required it should be clear whether the committed version or the current "dirty" version of the pipeline is being run.
Environment information
DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.15.1
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.4.0
scmrepo = 3.3.6
Supports:
http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.6.1, boto3 = 1.34.131)
Config:
Global: ~/.config/dvc
System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c8f65ca41ec45168d44ff0121e3c0037
+1
We have the same issue, anyone have an idea of what's causing this?