dbt-core
dbt-core copied to clipboard
[CT-1162] Blueprint optional project ref statement parameter
Success Criteria:
- Links to existing files in dbt-core that need to be updated
- Net new files in dbt-core and why it needs new files
- Potential "gotcha" implementations based on the above(think: if we update this file here, we have to update 100 files downstream)
- pseudo-code snippets on initial approach for the above to make this functionality work
Related issues / PRs:
- https://github.com/dbt-labs/dbt-core/issues/1269
- https://github.com/dbt-labs/dbt-core/pull/3053
Links to Existing Files (WIP)
- Model uniqueness constraint across projects (core/dbt/parser/manifest.py#971)
- Some possibilities related to the
target_model_package
which we currently allow as an argument toref
:- Here's where we begin to process the ref and the package name comes into play (dbt/parser/manifest.py)
- The manifest is then responsible for resolving this ref with an internal method here. This method accepts the
target_model_package
as an Optional argument.
Progress Update
Question 1: What do we need to do to make ref
accept a project
parameter?
Answer: Nothing! Currently, the ref
argument accepts two arguments:
-
target_model_package
-
target_model_name
The logic exists within _process_refs_for_node
if len(ref) == 1:
target_model_name = ref[0]
elif len(ref) == 2:
target_model_package, target_model_name = ref
However, the projects are scoped to what we've defined in our packages.yml
file as well as internal packages. The upstream project is not a package though. We don't need to have the project locally but just a representation of it, something like a pared-down manifest.json
file.
Question 2: What do we need to do then to resolve a reference to an upstream contract?
This is where it starts to get tricky.
The _process_refs_for_node
function mentioned above will eventually call resolve_ref
, which is a method on the Manifest
class. This eventually calls an internal object called ref_lookup
, which is a pointer to a RefableLookup
object on the manifest
instance. This object has a storage
property that is populate
d from the nodes
in the instantiated manifest
def populate(self, manifest):
for node in manifest.nodes.values():
self.add_node(node)
My guess is we don't want to include a "node" from an upstream project within the nodes
in our manifest
, as we don't want these to actually be run in our project's scope. We simply need a way to resolve the appropriate references to the locations of those contract nodes.
So, what could be some possible implementations:
Alter the way we populate
the ref_lookup
?
def populate(self, manifest):
for node in chain(manifest.nodes.values(), manifest.contracts_upstream.values()):
self.add_node(node)
This assumes:
- we create a
contracts_upstream
property in ourmanfiest
- The
Manifest
dataclass would have to be updated to account for a new field (dbt/contracts/graph/manifest.py) - [WIP] - Still investigating other places where we'd have to update to accommodate for a change like this
- The
- it would be constructed very similarly to our
nodes
, in that each resource would be of theManifestNode
type. Presumably, though, the structure for this would already have been created by the upstream project and we should just be able to leverage that during parse time.
contracts_upstream: MutableMapping[str, ManifestNode] = field(default_factory=dict)
Gotchas
Without making any changes to how this is populated, the names of each node would have to be unique across each project. Otherwise, you'd run into scenarios where you would overwrite an existing node within the storage
dictionary if it exists more than one time across each project.
This is how a node is added to the storage
dictionary:
def add_node(self, node: ManifestNode):
if node.resource_type in self._lookup_types:
if node.name not in self.storage:
self.storage[node.name] = {}
self.storage[node.name][node.package_name] = node.unique_id
Create a new ContractLookup
When we resolve_ref
the node, it could be set to this:
node = (
self.ref_lookup.find(target_model_name, pkg, self)
or self.contract_lookup.find(target_model_name, pkg, self)
)
If it's not found in the manifest's nodes
then look in the contracts_upstream
.
Open Questions:
- How does this hinder being able to use a
contracts
-like selector to build from? - What would we have to do to include these contracts in docs?
- Would this even work?
After your first answer to your open questions, let me know where I can help round out the answers
Internal slack feedback for the first attempt in the related PR
Things that this makes me start thinking about:
- Does this work if the upstream model has the same name as a model in the current project? What would need to change (this PR)?
- How should the upstream project specify which models are public / ref'able?
- What sort of error should the downstream project see when trying to ref a private model?
- Should a "bad ref from other project" error be raised at parse time, or as the very first step during runtime? The former sounds better at first blush, but it would mean that you always need the full "export" artifact from the upstream project just to parse your own project (including dbt ls). I'm honestly not sure about this one!
Does this work if the upstream model has the same name as a model in the current project? What would need to change (https://github.com/dbt-labs/dbt-core/pull/3053)?
My first inclination is that this will work if an upstream model has the same name as one in the current project. The reason being is how uniqueness is determined:
def _check_resource_uniqueness(
manifest: Manifest,
config: RuntimeConfig,
) -> None:
names_resources: Dict[str, ManifestNode] = {}
alias_resources: Dict[str, ManifestNode] = {}
for resource, node in manifest.nodes.items():
This function iterates through the nodes
property on the manifest
. This proposal creates an entirely new property, consumers
on the manifest
, which wouldn't be considered, at least in its current form, as dbt checks for uniqueness.
How should the upstream project specify which models are public / ref'able?
I really like @christineberger's suggestion in the original discussion where the developer explicitly defines what models to make public. As an aside, I think the default here is false
and it's up to the developer to opt-in to anything being made public.
models:
# Model folder level
+shared: true
# Sub-folder level
staging:
+shared: false
This configuration would share all models except for one's located in the staging
folder. This configuration would then be used to inform what actually makes it into a json file that a downstream project could consume.
What sort of error should the downstream project see when trying to ref a private model?
I think the same compilation error that you would get today if ref
ing a model that doesn't exist in your own project. We shouldn't even state that they're trying to ref
a private model.
I think you should treat it as web services treat private resources. For instance, I know the urls of private repos at my last company, but because I no longer have permission to view those, I'll receive 404s if I try directly navigating to those pages. They exist! But that doesn't mean I should have that information.
Another implementation thought I had is to add some more properties to our NodeConfig
(or something in a higher parent class). The idea being that these nodes
that we're pulling in from upstream projects, are just that, nodes. Shouldn't they live alongside the nodes that we've compiled from our project and other packages? What could some other properties include that would delineate these from the others though?
- execute
- This should have a default to True
. The thinking behind this is we need a way to identify nodes from contracts so we don't run them within our project's scope (unless given permission to do so by the upstream project)
- share
- This would be used by whatever functions we develop to create the json artifact an upstream producer produces.
- is_contract
- Pretty self explanatory
I think the downstream impacts of this approach are much greater than the initial implementation proposed using a ConsumerLookup
but it may make more sense.