dvc icon indicating copy to clipboard operation
dvc copied to clipboard

odb: fix ambiguous remote resolution

Open pared opened this issue 3 years ago • 8 comments

Related : https://github.com/iterative/studio-support/issues/57

  1. Ambiguous behavior: For given script:
#!/bin/bash

set -exu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

git init 
dvc init

echo "data" > data
echo "data2" > data2

dvc add data data2

echo "  remote: myremote" >> data2.dvc
dvc remote add myremote $main/storage

git add -A
git commit -am "initial"

dvc push -r myremote
rm -rf data data2 .dvc/cache

dvc pull

in fetch

we will get three "different" odb. (None, Local#1, and Local#2). This is happening due to two things: one is that we don't evaluate "no" remote to our default remote, which, for consistency, we probably should. Another, the bigger problem is that two odb's are never equal: https://github.com/iterative/dvc-objects/issues/134 While it might not be a big issue for us (yet), Studio might be impacted considerably by it, as we are creating separate remotes for each output. Even though we should use one.

It seems that it's also related to https://github.com/iterative/dvc-objects/issues/4 - with odb caching we could probably also avoid spawning new odb's.

cc @Suor

pared avatar Aug 25 '22 08:08 pared

we will get three "different" odb. (None, Local#1, and Local#2). This is happening due to two things: one is that we don't evaluate "no" remote to our default remote, which, for consistency, we probably should.

None means "this object can come from any ODB". Having an explicit ODB set during object collection means "this object has to come from this ODB". (None does not mean "this object has to come from the default DVC ODB").

During current push/fetch, we check the default DVC remote (or a remote explicitly passed via -r/--remote) for objects with None/"any ODB". This is implementation specific to the current push/fetch behavior though. We have discussed the potential for future behavior where you may be able to configure multiple possible remotes (i.e. mirrors) to fetch from. In this case, None/"any ODB" means we would check any of the available mirrors.

So IMO returning a None ODB (and not using "default DVC remote ODB") during object collection is the correct behavior.

While it might not be a big issue for us (yet), Studio might be impacted considerably by it, as we are creating separate remotes for each output. Even though we should use one.

The (lack of) ODB filesystem caching is a known issue, but is this actually affecting Studio? From the linked issue it doesn't look like this is causing problems (the problem was that studio doesn't support per-output remotes).

pmrowla avatar Aug 25 '22 13:08 pmrowla

with odb caching we could probably also avoid spawning new odb's.

Instantiating ODB does not involve anything other than simple __init__, and the underlying filesystems are already cached. I did not expect that to be slow.

skshetry avatar Aug 25 '22 13:08 skshetry

I think the issue with ODB caching was mainly related to having to re-instantiate the FS's when building reference objects. But that may have already been resolved with the current dvc-objects/dvc-data changes.

pmrowla avatar Aug 25 '22 13:08 pmrowla

Instantiating ODB does not involve anything other than simple init, and the underlying filesystems are already cached. I did not expect that to be slow.

Is there a use case where this behavior is expected? If the underlying fs are cached, why not cache this too?

pared avatar Aug 25 '22 14:08 pared

Instantiating ODB does not involve anything other than simple init, and the underlying filesystems are already cached. I did not expect that to be slow.

Is there a use case where this behavior is expected? If the underlying fs are cached, why not cache this too?

Why cache if you don’t need to? In fsspec filesystems, there may be state/sessions/client that need to be reused/reopened so caching helps.

For ODB, it’s not necessary, and caching comes with it’s own set of problems.

skshetry avatar Aug 25 '22 14:08 skshetry

fs comparison is a broader topic, we would be ideally using https://github.com/fsspec/filesystem_spec/issues/773 , but that requires a bit of work.

efiop avatar Aug 25 '22 21:08 efiop

The suspicious even technically part for me is that ObjectDB.__eq__ does not correspond to ObjectDB.__hash__. The latter is quite strait-forward (self.fs.protocol, self.path).

These two are supposed to have a matching logic, as I understand.

Suor avatar Aug 26 '22 10:08 Suor

So IMO returning a None ODB (and not using "default DVC remote ODB") during object collection is the correct behavior.

It might be understood like that. But after collection None need to be merged with a odb instance for a default/chosen remote. So that their sets would be joined. Otherwise, all that stuff will be downloaded serially not in parallel - see the second part of Repo.fetch() implementation().

Having a distinct odb for each out with remote set adds to the issue above. One might have hundreads and thousands of serial downloads then.

The (lack of) ODB filesystem caching is a known issue, but is this actually affecting Studio?

Once we start to support out.remote it may become the same issue as described above. Unless we dedup odbs manually, which intend to do for now.

Suor avatar Aug 26 '22 10:08 Suor