dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc.api.open (python): fails with an absolute path

Open hugo-ricateau opened this issue 3 years ago • 6 comments

Description

When trying to dvc.api.open a file present on a remote but not locally (neither in cache), call fails with the following exception if the provided path is absolute (it works fine if the provided path is relative to the dvc repository):

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:125, in _DataFileSystem.info(self, path, **kwargs)
    124 try:
--> 125     outs = list(self.repo.index.tree.iteritems(key))  # noqa: B301
    126 except KeyError as exc:

File /usr/local/lib/python3.9/site-packages/dvc_data/objects/tree.py:106, in Tree.iteritems(self, prefix)
    104         self._load(key, meta, hash_info)
--> 106 for key, (meta, hash_info) in self._trie.iteritems(**kwargs):
    107     self._load(key, meta, hash_info)

File /usr/local/lib/python3.9/site-packages/pygtrie.py:718, in Trie.iteritems(self, prefix, shallow)
    678 """Yields all nodes with associated values with given prefix.
    679
    680 Only nodes with values are output.  For example::
   (...)
    716     KeyError: If ``prefix`` does not match any node.
    717 """
--> 718 node, _ = self._get_node(prefix)
    719 for path, value in node.iterate(list(self.__path_from_key(prefix)),
    720                                 shallow, self._iteritems):

File /usr/local/lib/python3.9/site-packages/pygtrie.py:630, in Trie._get_node(self, key)
    629 if node is None:
--> 630     raise KeyError(key)
    631 trace.append((step, node))

KeyError: ('local', '/', 'repo', 'path', 'to', 'target.ext')

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:505, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    504 try:
--> 505     with fs.open(
    506         fs_path,
    507         mode=mode,
    508         encoding=encoding,
    509         remote=remote,
    510     ) as fobj:
    511         yield fobj

File /usr/local/lib/python3.9/site-packages/dvc_objects/fs/base.py:191, in FileSystem.open(self, path, mode, **kwargs)
    190     kwargs.pop("encoding", None)
--> 191 return self.fs.open(path, mode=mode, **kwargs)

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:88, in _DataFileSystem.open(self, path, mode, encoding, **kwargs)
     85 def open(  # type: ignore
     86     self, path: str, mode="r", encoding=None, **kwargs
     87 ):  # pylint: disable=arguments-renamed, arguments-differ
---> 88     fs, fspath = self._get_fs_path(path, **kwargs)
     89     return fs.open(fspath, mode=mode, encoding=encoding)

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:65, in _DataFileSystem._get_fs_path(self, path, remote)
     63 from dvc.config import NoRemoteError
---> 65 info = self.info(path)
     66 if info["type"] == "directory":

File /usr/local/lib/python3.9/site-packages/dvc/fs/data.py:127, in _DataFileSystem.info(self, path, **kwargs)
    126 except KeyError as exc:
--> 127     raise FileNotFoundError from exc
    129 ret = {
    130     "type": "file",
    131     "size": 0,
   (...)
    135     "name": path,
    136 }

FileNotFoundError:

The above exception was the direct cause of the following exception:

FileMissingError                          Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 with open('/repo/path/to/target.ext') as file:
      2     print(file.read(10))

File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File /usr/local/lib/python3.9/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
    196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
    197     with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198         with _repo.open_by_relpath(
    199             path, remote=remote, mode=mode, encoding=encoding
    200         ) as fd:
    201             yield fd

File /usr/local/lib/python3.9/contextlib.py:119, in _GeneratorContextManager.__enter__(self)
    117 del self.args, self.kwds, self.func
    118 try:
--> 119     return next(self.gen)
    120 except StopIteration:
    121     raise RuntimeError("generator didn't yield") from None

File /usr/local/lib/python3.9/site-packages/dvc/repo/__init__.py:513, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    511         yield fobj
    512 except FileNotFoundError as exc:
--> 513     raise FileMissingError(path) from exc
    514 except IsADirectoryError as exc:
    515     raise DvcIsADirectoryError(f"'{path}' is a directory") from exc

FileMissingError: Can't find '/repo/path/to/target.ext' neither locally nor on remote

Reproduce

Create a repository (at e.g. /repo) with an s3 remote; add a file (/repo/path/to/target.ext) with some lipsum content; add it to dvc; push to remote; delete locally (including cache); then

from dvc.api import open

with open('/repo/path/to/target.ext') as file:
    pass

should raise, while

from dvc.api import open

with open('path/to/target.ext') as file:
    pass

should work as expected (assuming pwd being at /repo).

Expected

Relative path being interpreted as relative to the current working directory, and absolute path as the path they describe.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.0 (pip)
---------------------------------
Platform: Python 3.9.13 on Linux-5.10.104-linuxkit-x86_64-with-glibc2.31
Supports:
	webhdfs (fsspec = 2022.5.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.5.2),
	s3 (s3fs = 2022.5.0, boto3 = 1.21.21),
	ssh (sshfs = 2022.6.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3
Workspace directory: overlay on overlay
Repo: dvc (no_scm)

hugo-ricateau avatar Jul 21 '22 16:07 hugo-ricateau

@hugo-ricateau Absolute paths in open() are considered to be external local outputs (legacy thing). If you have an absolute path already to a file, you should be able to just open it yourself, as dvc.api.open is really meant to work with dvc repositories with paths that are detached from where your repo is located.

Closing as expected behaviour.

efiop avatar Jul 21 '22 20:07 efiop

Absolute paths in open() are considered to be external local outputs (legacy thing).

If it is a "legacy thing", why keeping this behaviour?

If you have an absolute path already to a file, you should be able to just open it yourself [...]

I think the most important feature of dvc is the ability to offload large files to a remote and purge them from local environments, while still being able to access them; if they are not on the file-system, having an absolute path is of no help.

[...] as dvc.api.open is really meant to work with dvc repositories with paths that are detached from where your repo is located.

Properly computing such relative path might be harder than computing the absolute one: __file__ easily gives a local reference point, however, finding the repository path requires more code to be resiliently determined.

@efiop, I really think you should consider reworking this "expected behaviour" as it is extremely confusing (and by the way, not documented at all in the docstring of the open method) in addition to be an easy feature left beside: dvc package is in the best position to determine repository path and identify that the provided absolute one is_relative_to it.

hugo-ricateau avatar Jul 21 '22 21:07 hugo-ricateau

@hugo-ricateau I really wish we could drop that legacy thing 😄 But it will probably have to wait for 3.0.

Properly computing such relative path might be harder than computing the absolute one: file easily gives a local reference point, however, finding the repository path requires more code to be resiliently determined.

Would something like this work for you? (Repo will look for .dvc in path parents)

from dvc.repo import Repo

Repo(path).root_dir

Could you elaborate on your scenario, please? Maybe I don't quite understand it.

Note that api.open(path) is the same as api.open(path, repo="."), so it won't even work unless you are in a repo already. So we could consider supporting what you are suggesting, but it will likely be a breaking change (unless I'm missing something).

efiop avatar Jul 21 '22 21:07 efiop

@efiop, Repo.root_dir partially answers the issue as it definitely eases transforming the absolue path into the correct relatif one before providing it to api.open.

Could you elaborate on your scenario, please? Maybe I don't quite understand it.

Sure, the use case is to save the model right along the source code employed to generate it: e.g. foo/bar.py generates foo/bar.model. Thus, employing __file__ to resiliently determine the source path (which is relative to pwd by default but can thus easily be transformed in an absolute one). Then this model this added to dvc, pushed to the remote and (generally) removed from the local file-system. When loading the model, the same path manipulation is employed (targeting an absolute path as pwd might be different), and passed to api.open in order to stream the model from the remote. The correct call to do that should be something like api.open(path.relative_to(Repo(path).root_dir), repo=Repo(path).root_dir).

What I'm expecting when providing an absolute path to api.open is that it opens the target file provided it exists / belongs to the current repo and is under tracking: thus having something like

path = path.relative_to(repo.root_dir) if path.is_absolute() else path

as a first step in api.open; which indeed raises if the provided path is outside the repo (the provided or default one).

hugo-ricateau avatar Jul 21 '22 22:07 hugo-ricateau

Note that repo.Repo fails if the provided path does not exists: typical use-case is a path within a dvc-tracked directory.

hugo-ricateau avatar Jul 21 '22 22:07 hugo-ricateau

@efiop Let's keep it open as a feature request?

It makes sense that to avoid having to separately pass a repo URL and a separate relative path URL. @efiop and I have discussed that we both would prefer to get rid of this distinction.

I can't promise that we will get to it any time soon, though. I think the current behavior is expected and explained in the docstring:

https://github.com/iterative/dvc/blob/44a4fb59a7ef812ff84671fc68d1ae7843b9ac03/dvc/api/data.py#L73-L77

The same is explained in https://dvc.org/doc/api-reference/open#parameters:

path (required) - location and file name of the target to open, relative to the root of the project (repo).

repo - specifies the location of the DVC project. It can be a URL or a file system path. Both HTTP and SSH protocols are > supported for online Git repos (e.g. [user@]server:project.git). Default: The current project is used (the current working > directory tree is walked up to find it).

dberenbaum avatar Jul 27 '22 18:07 dberenbaum

I think the current behavior is expected and explained in the docstring:

https://github.com/iterative/dvc/blob/44a4fb59a7ef812ff84671fc68d1ae7843b9ac03/dvc/api/data.py#L73-L77

The same is explained in https://dvc.org/doc/api-reference/open#parameters:

@dberenbaum, I can't figure out why at the moment, but the docstring displayed within ipython (using open? after a from dvc.api import open) is quite different and much less complete:

Signature: open(path, repo=None, rev=None, remote=None, mode='r', encoding=None)
Docstring:
Open file in the supplied path tracked in a repo (both DVC projects and
plain Git repos are supported). For Git repos, HEAD is used unless a rev
argument is supplied. The default remote is tried unless a remote argument
is supplied. It may only be used as a context manager:

    with dvc.api.open(
            'path/to/file',
            repo='https://example.com/url/to/repo'
            ) as fd:
        # ... Handle file object fd
File:      ~/.pyenv/versions/3.6.15/envs/ds/lib/python3.6/site-packages/dvc/api.py
Type:      function

hugo-ricateau avatar Aug 17 '22 08:08 hugo-ricateau

@dberenbaum, I can't figure out why at the moment, but the docstring displayed within ipython (using open? after a from dvc.api import open) is quite different and much less complete:

@hugo-ricateau I updated the docstring in https://github.com/iterative/dvc/commit/15e10a1d81a45018d61d5287c20c73c3aa49c266

You might be looking at a version previous to this update.

daavoo avatar Aug 17 '22 09:08 daavoo