dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[CT-1162] Blueprint optional project ref statement parameter

Open dpguthrie opened this issue 1 year ago • 8 comments

Success Criteria:

  • Links to existing files in dbt-core that need to be updated
  • Net new files in dbt-core and why it needs new files
  • Potential "gotcha" implementations based on the above(think: if we update this file here, we have to update 100 files downstream)
  • pseudo-code snippets on initial approach for the above to make this functionality work

dpguthrie avatar Sep 13 '22 10:09 dpguthrie

Related issues / PRs:

  • https://github.com/dbt-labs/dbt-core/issues/1269
  • https://github.com/dbt-labs/dbt-core/pull/3053

dpguthrie avatar Sep 13 '22 11:09 dpguthrie

Links to Existing Files (WIP)

  • Model uniqueness constraint across projects (core/dbt/parser/manifest.py#971)
  • Some possibilities related to the target_model_package which we currently allow as an argument to ref:
    • Here's where we begin to process the ref and the package name comes into play (dbt/parser/manifest.py)
    • The manifest is then responsible for resolving this ref with an internal method here. This method accepts the target_model_package as an Optional argument.

dpguthrie avatar Sep 13 '22 11:09 dpguthrie

Progress Update

Question 1: What do we need to do to make ref accept a project parameter?

Answer: Nothing! Currently, the ref argument accepts two arguments:

  1. target_model_package
  2. target_model_name

The logic exists within _process_refs_for_node

if len(ref) == 1:
    target_model_name = ref[0]
elif len(ref) == 2:
    target_model_package, target_model_name = ref

However, the projects are scoped to what we've defined in our packages.yml file as well as internal packages. The upstream project is not a package though. We don't need to have the project locally but just a representation of it, something like a pared-down manifest.json file.

Question 2: What do we need to do then to resolve a reference to an upstream contract?

This is where it starts to get tricky.

The _process_refs_for_node function mentioned above will eventually call resolve_ref, which is a method on the Manifest class. This eventually calls an internal object called ref_lookup, which is a pointer to a RefableLookup object on the manifest instance. This object has a storage property that is populated from the nodes in the instantiated manifest

def populate(self, manifest):
    for node in manifest.nodes.values():
        self.add_node(node)

My guess is we don't want to include a "node" from an upstream project within the nodes in our manifest, as we don't want these to actually be run in our project's scope. We simply need a way to resolve the appropriate references to the locations of those contract nodes.

So, what could be some possible implementations:

Alter the way we populate the ref_lookup?

def populate(self, manifest):
    for node in chain(manifest.nodes.values(), manifest.contracts_upstream.values()):
        self.add_node(node)

This assumes:

  • we create a contracts_upstream property in our manfiest
    • The Manifest dataclass would have to be updated to account for a new field (dbt/contracts/graph/manifest.py)
    • [WIP] - Still investigating other places where we'd have to update to accommodate for a change like this
  • it would be constructed very similarly to our nodes, in that each resource would be of the ManifestNode type. Presumably, though, the structure for this would already have been created by the upstream project and we should just be able to leverage that during parse time.
contracts_upstream: MutableMapping[str, ManifestNode] = field(default_factory=dict)

Gotchas

Without making any changes to how this is populated, the names of each node would have to be unique across each project. Otherwise, you'd run into scenarios where you would overwrite an existing node within the storage dictionary if it exists more than one time across each project.

This is how a node is added to the storage dictionary:

def add_node(self, node: ManifestNode):
    if node.resource_type in self._lookup_types:
        if node.name not in self.storage:
            self.storage[node.name] = {}
        self.storage[node.name][node.package_name] = node.unique_id

Create a new ContractLookup

When we resolve_ref the node, it could be set to this:

node = (
    self.ref_lookup.find(target_model_name, pkg, self)
    or self.contract_lookup.find(target_model_name, pkg, self)
)

If it's not found in the manifest's nodes then look in the contracts_upstream.

Open Questions:

  • How does this hinder being able to use a contracts-like selector to build from?
  • What would we have to do to include these contracts in docs?
  • Would this even work?

dpguthrie avatar Sep 19 '22 20:09 dpguthrie

After your first answer to your open questions, let me know where I can help round out the answers

sungchun12 avatar Sep 19 '22 21:09 sungchun12

Internal slack feedback for the first attempt in the related PR

Things that this makes me start thinking about:

  • Does this work if the upstream model has the same name as a model in the current project? What would need to change (this PR)?
  • How should the upstream project specify which models are public / ref'able?
  • What sort of error should the downstream project see when trying to ref a private model?
  • Should a "bad ref from other project" error be raised at parse time, or as the very first step during runtime? The former sounds better at first blush, but it would mean that you always need the full "export" artifact from the upstream project just to parse your own project (including dbt ls). I'm honestly not sure about this one!

dpguthrie avatar Sep 20 '22 16:09 dpguthrie

Does this work if the upstream model has the same name as a model in the current project? What would need to change (https://github.com/dbt-labs/dbt-core/pull/3053)?

My first inclination is that this will work if an upstream model has the same name as one in the current project. The reason being is how uniqueness is determined:

def _check_resource_uniqueness(
    manifest: Manifest,
    config: RuntimeConfig,
) -> None:
    names_resources: Dict[str, ManifestNode] = {}
    alias_resources: Dict[str, ManifestNode] = {}

    for resource, node in manifest.nodes.items():

This function iterates through the nodes property on the manifest. This proposal creates an entirely new property, consumers on the manifest, which wouldn't be considered, at least in its current form, as dbt checks for uniqueness.

dpguthrie avatar Sep 20 '22 16:09 dpguthrie

How should the upstream project specify which models are public / ref'able?

I really like @christineberger's suggestion in the original discussion where the developer explicitly defines what models to make public. As an aside, I think the default here is false and it's up to the developer to opt-in to anything being made public.

models:
    # Model folder level
    +shared: true
    # Sub-folder level
    staging:
       +shared: false

This configuration would share all models except for one's located in the staging folder. This configuration would then be used to inform what actually makes it into a json file that a downstream project could consume.

dpguthrie avatar Sep 20 '22 16:09 dpguthrie

What sort of error should the downstream project see when trying to ref a private model?

I think the same compilation error that you would get today if refing a model that doesn't exist in your own project. We shouldn't even state that they're trying to ref a private model.

I think you should treat it as web services treat private resources. For instance, I know the urls of private repos at my last company, but because I no longer have permission to view those, I'll receive 404s if I try directly navigating to those pages. They exist! But that doesn't mean I should have that information.

dpguthrie avatar Sep 20 '22 17:09 dpguthrie

Another implementation thought I had is to add some more properties to our NodeConfig (or something in a higher parent class). The idea being that these nodes that we're pulling in from upstream projects, are just that, nodes. Shouldn't they live alongside the nodes that we've compiled from our project and other packages? What could some other properties include that would delineate these from the others though? - execute - This should have a default to True. The thinking behind this is we need a way to identify nodes from contracts so we don't run them within our project's scope (unless given permission to do so by the upstream project) - share - This would be used by whatever functions we develop to create the json artifact an upstream producer produces. - is_contract - Pretty self explanatory

I think the downstream impacts of this approach are much greater than the initial implementation proposed using a ConsumerLookup but it may make more sense.

dpguthrie avatar Sep 28 '22 22:09 dpguthrie