odb: fix ambiguous remote resolution
Related : https://github.com/iterative/studio-support/issues/57
- Ambiguous behavior: For given script:
#!/bin/bash
set -exu
pushd $TMPDIR
wsp=test_wspace
rep=test_repo
rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)
mkdir $rep && pushd $rep
git init
dvc init
echo "data" > data
echo "data2" > data2
dvc add data data2
echo " remote: myremote" >> data2.dvc
dvc remote add myremote $main/storage
git add -A
git commit -am "initial"
dvc push -r myremote
rm -rf data data2 .dvc/cache
dvc pull
in fetch
we will get three "different" odb. (None, Local#1, and Local#2). This is happening due to two things: one is that we don't evaluate "no" remote to our default remote, which, for consistency, we probably should. Another, the bigger problem is that two odb's are never equal: https://github.com/iterative/dvc-objects/issues/134 While it might not be a big issue for us (yet), Studio might be impacted considerably by it, as we are creating separate remotes for each output. Even though we should use one.
It seems that it's also related to https://github.com/iterative/dvc-objects/issues/4 - with odb caching we could probably also avoid spawning new odb's.
cc @Suor
we will get three "different" odb. (None, Local#1, and Local#2). This is happening due to two things: one is that we don't evaluate "no" remote to our default remote, which, for consistency, we probably should.
None means "this object can come from any ODB". Having an explicit ODB set during object collection means "this object has to come from this ODB". (None does not mean "this object has to come from the default DVC ODB").
During current push/fetch, we check the default DVC remote (or a remote explicitly passed via -r/--remote) for objects with None/"any ODB". This is implementation specific to the current push/fetch behavior though. We have discussed the potential for future behavior where you may be able to configure multiple possible remotes (i.e. mirrors) to fetch from. In this case, None/"any ODB" means we would check any of the available mirrors.
So IMO returning a None ODB (and not using "default DVC remote ODB") during object collection is the correct behavior.
While it might not be a big issue for us (yet), Studio might be impacted considerably by it, as we are creating separate remotes for each output. Even though we should use one.
The (lack of) ODB filesystem caching is a known issue, but is this actually affecting Studio? From the linked issue it doesn't look like this is causing problems (the problem was that studio doesn't support per-output remotes).
with odb caching we could probably also avoid spawning new odb's.
Instantiating ODB does not involve anything other than simple __init__, and the underlying filesystems are already cached. I did not expect that to be slow.
I think the issue with ODB caching was mainly related to having to re-instantiate the FS's when building reference objects. But that may have already been resolved with the current dvc-objects/dvc-data changes.
Instantiating ODB does not involve anything other than simple init, and the underlying filesystems are already cached. I did not expect that to be slow.
Is there a use case where this behavior is expected? If the underlying fs are cached, why not cache this too?
Instantiating ODB does not involve anything other than simple init, and the underlying filesystems are already cached. I did not expect that to be slow.
Is there a use case where this behavior is expected? If the underlying fs are cached, why not cache this too?
Why cache if you don’t need to? In fsspec filesystems, there may be state/sessions/client that need to be reused/reopened so caching helps.
For ODB, it’s not necessary, and caching comes with it’s own set of problems.
fs comparison is a broader topic, we would be ideally using https://github.com/fsspec/filesystem_spec/issues/773 , but that requires a bit of work.
The suspicious even technically part for me is that ObjectDB.__eq__ does not correspond to ObjectDB.__hash__. The latter is quite strait-forward (self.fs.protocol, self.path).
These two are supposed to have a matching logic, as I understand.
So IMO returning a None ODB (and not using "default DVC remote ODB") during object collection is the correct behavior.
It might be understood like that. But after collection None need to be merged with a odb instance for a default/chosen remote. So that their sets would be joined. Otherwise, all that stuff will be downloaded serially not in parallel - see the second part of Repo.fetch() implementation().
Having a distinct odb for each out with remote set adds to the issue above. One might have hundreads and thousands of serial downloads then.
The (lack of) ODB filesystem caching is a known issue, but is this actually affecting Studio?
Once we start to support out.remote it may become the same issue as described above. Unless we dedup odbs manually, which intend to do for now.