qri restructure dependencies as dscache comes into being

We currently have the this dependency chain in our codebase, and it has some issues:

+----------+
| cmd      |
+----------+
    |
    | **** This is a problem, these packages should be siblings
    v
+----------+
| api      |
+----------+
    |
    |
    v
+----------+
| lib      |
+----------+
    |
    |
    v
+----------+
| fsi      |
+----------+
    |
    |
    v
+----------+
| base     |
+----------+
    |
    | **** Also a problem, repo should be higher up
    v
+----------+
| repo     |
+----------+
    |
    |
    v
+----------+
| dscache  |
+----------+

As we move away from repo, and towards dscache, this should be changed to this:

+----------+     +----------+
| cmd      |     | api      |
+----------+     +----------+
    |                |
    |     /----------/
    v    v
+----------+
| lib      |------\
+----------+       |
    |              v
    |            +----------+
    |            | dscache  |<...
    |            +----------+    .
    |                |           .
    v                v           . **** Indirect, via callback
+----------+     +----------+    .
| base     |     | fsi      |    .
+----------+     +----------+   .
    |     \                    .
    |      ---------\         .
    v               v        .
+----------+     +----------+
| dsfs     |     | logbook  |
+----------+     +----------+

The key takeaways from this better dependency graph:

dscache is a high-level package. It depends on logbook in a logical sense, and starts doing name-resolution in lib, as well as storing info about multiple subsystems
We should move fsi under dscache so that it is no longer exposed to lib. This will help simplify our story about handling datasets in repo vs fsi in the same fashion.
base currently uses repo because it has to write a ref to the repo when saving. We fix this in the new version by writing to logbook (which is low-level), and logbook updates dscache indirectly by invoking a callback
We currently have cmd depending on api, because qri connect has to start the api. We can do another callback trick here, by having the cmd constructor pass in a callback that can be used to startup the api server.

The main motivation for this change is dscache development. A lot of changes lead to circular dependencies that necessitate creating subpackages in dscache. For example: repo knows about dscache, but dscache can also be built from a repo (by combining logbook, refs, and fsi). So we had to move this builder into dscache/build/from_repo.go. Also, when developing qri get using dscache, we want to be able to load datasets from dscache or from fsi within dscache's package, but this leads to a circular dependency (see first chart), so the code has to live instead in dscache/loader/loader.go

Mar 02 '20 18:03 dustmop

We should move fsi under dscache so that it is no longer exposed to lib. This will help simplify our story about handling datasets in repo vs fsi in the same fashion.

Disagree on this. I don't think file system integration is purely the domain of a cache. In #1133, we started to talk about critical path packages organized into an import stack and event driven packages that must be siblings with few/no critical path imports. Full notes here. A lib.Instance should have an internal fsi field.

I totally see the impetus to cut down the complexity of all these subsystems interacting, just not sure that we should solve that by stacking imports.

We might be arriving at a place where subsystems need to use events to communicate between each other. The need for callbacks seems to indicate yes, but then we're stuck back at this "how do you know you're finished" problem.

Mar 02 '20 20:03 b5

Ok, I thought some more about this, and I have a recommendation. First, agreed that FSI is not in the domain of a cache, totally correct. However, I think the current relationship with lib and fsi is a problem. There's a lot of code duplication like this:

		ref, err := base.ToDatasetRef(p.Ref, r.repo, p.UseFSI)
		if err != nil {
			return err
		}
		if p.UseFSI {
			if ref.FSIPath == "" {
				return fsi.ErrNoLink
			}
			if ds, err = fsi.ReadDir(ref.FSIPath); err != nil {
				return fmt.Errorf("loading linked dataset: %s", err)
			}
		} else {
			ds, err = dsfs.LoadDataset(ctx, r.repo.Store(), ref.Path)
			if err != nil {
				return fmt.Errorf("loading dataset: %s", err)
			}
		}

this can be seen in Get in lib/datasets.go and RenderReadme in lib/render.go and Validate in lib/datasets.go. This pattern should be extracted and put into a single location that is responsible for all ref resolution.

Relatedly, while looking at the codepath in base/save.go specifically PrepareDataset I noticed it is calling CanonicalizeDataset in order to get the previousPath. This is bad for two reasons:

calling CanonicalizeDataset this low in the stack is a problem, resolution should live higher
we should switch to initIDs at the base level, instead of passing dsrefs (string or parsed)

Ideally, as we move to dscache, the base package wouldn't have a reference to the dscache at all, in order to match the dependency tree I made above.

Here's the proposal:

a new package called resolve/ which is very low on the dependency tree
this package contains only an interface that accepts an initID and returns a dataset, or maybe just a previous path. Keep it minimal, is my point. Maybe for now it also has a method to convert a dsref into a dataset, but we switch away from that over time.
base takes a Resolver interface and an initID
dscache can return an instance of this Resolver which is used by base to get the previous dataset version
another subpackage like resolve/names/ that is a very-high level package, and should only be in lib/. It wraps dscache and FSI, and handles all resolving of user-specific dataset references by turning them into concrete Datasets in whatever manner is needed. Using the NameResolver cleans up all of the messy duplication that's currently in lib/.

We spoke offline a few days ago, and I don't think we should switch to asynchronous events for things like writing to logbook and updating dscache. All of those activities are intrinsically synchronous and should stay that way. We should use callbacks or, even better, interfaces wherever we can to handle these dependency problems.

Mar 12 '20 21:03 dustmop

qri qri copied to clipboard

restructure dependencies as dscache comes into being

qri
qri copied to clipboard