Overhaul data model of dependency graphs for better query performance and relationship type support
Current Behavior
How Dependency Graphs Work Today
Both the PROJECT and COMPONENT object have a DIRECT_DEPENDENCIES column, which contains a JSON array of serialized ComponentIdentity objects.
This roughly resembles how dependencies are represented in CycloneDX v1.5 and earlier.
While this approach works, it has a few downsides:
- Referential integrity can't be verified by the RDBMS. Since IDs are encoded in JSON, foreign keys can't be utilized.
- Poor query performance. Traversing the graph requires expensive
LIKEconditions such as"DIRECT_DEPENDENCIES" LIKE ('%' || :childComponentUuid || '%').- Per default, DT does not create a working index on the
DIRECT_DEPENDENCIEScolumn - A
LIKEcondition with wildcards on both ends requires a special index, such as GIN in PostgreSQL, which in turn requires an extension that is not enabled per default.
- Per default, DT does not create a working index on the
- It bloats table sizes and REST API responses, since JSON arrays can become quite large for projects with busy dependency graphs.
To give an example of how traversal queries look like in the current data model, here's one from Hyades that identifies whether a specific component (identified by the parameter :leadComponentUuid) is introduced through another component with the name foo:
Example Query
-- Determine the project the given leaf component is part of.
WITH RECURSIVE
"CTE_PROJECT" AS (
SELECT
"PROJECT_ID" AS "ID"
FROM
"COMPONENT"
WHERE
"UUID" = :leafComponentUuid
),
-- Identify the IDs of all components in the project that
-- match the desired criteria.
"CTE_MATCHES" AS (
SELECT
"ID"
FROM
"COMPONENT"
WHERE
"PROJECT_ID" = (SELECT "ID" FROM "CTE_PROJECT")
-- Do not consider other leaf nodes (typically the majority of components).
-- Because we're looking for parent nodes, they MUST have direct dependencies defined.
AND "DIRECT_DEPENDENCIES" IS NOT NULL
AND ("NAME" = 'foo') -- NB: These could be arbitrarily many conditions
),
"CTE_DEPENDENCIES" ("UUID", "PROJECT_ID", "FOUND", "PATH") AS (
SELECT
"C"."UUID" AS "UUID",
"C"."PROJECT_ID" AS "PROJECT_ID",
("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
ARRAY ["C"."ID"]::BIGINT[] AS "PATH"
FROM
"COMPONENT" AS "C"
WHERE
-- Short-circuit the recursive query if we don't have any matches at all.
EXISTS(SELECT 1 FROM "CTE_MATCHES")
-- Otherwise, find components of which the given leaf component is a direct dependency.
AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || :leafComponentUuid || '%')
UNION ALL
SELECT
"C"."UUID" AS "UUID",
"C"."PROJECT_ID" AS "PROJECT_ID",
("C"."ID" = ANY(SELECT "ID" FROM "CTE_MATCHES")) AS "FOUND",
ARRAY_APPEND("PREVIOUS"."PATH", "C"."ID") AS "PATH"
FROM
"COMPONENT" AS "C"
INNER JOIN
"CTE_DEPENDENCIES" AS "PREVIOUS" ON "PREVIOUS"."PROJECT_ID" = "C"."PROJECT_ID"
WHERE
-- If the previous row was a match already, we're done.
NOT "PREVIOUS"."FOUND"
-- Also, ensure we haven't seen this component before, to prevent cycles.
AND NOT ("C"."ID" = ANY("PREVIOUS"."PATH"))
-- Otherwise, the previous component must appear in the current direct dependencies.
AND "C"."DIRECT_DEPENDENCIES" LIKE ('%' || "PREVIOUS"."UUID" || '%')
)
SELECT BOOL_OR("FOUND") FROM "CTE_DEPENDENCIES";
CycloneDX v1.6
CycloneDX v1.6 will introduce the provides attribute to its Dependency model (source):
message Dependency {
// References a component or service by the its bom-ref attribute
string ref = 1;
// The bom-ref identifiers of the components or services that are dependencies of this dependency object.
repeated Dependency dependencies = 2;
// The bom-ref identifiers of the components or services that define a given specification or standard, which are provided or implemented by this dependency object.
repeated string provides = 3;
}
In order to support this, and potential future additions to relationships, we need to rework how we handle dependency graphs.
Proposed Behavior
I propose to revisit how we store and traverse dependency graphs.
Ideally, the graph structure would live in separate tables, and would be able to refer to multiple object types, such as Component, Service, Data, etc.
Instead of JSON, more efficient data types should be used, that are trivial to index and allow for referential integrity verification.
Just to put something out there:
CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
"ID" BIGSERIAL PRIMARY KEY,
"PROJECT_ID" BIGINT REFERENCES "PROJECT" ("ID") NULL,
"COMPONENT_ID" BIGINT REFERENCES "COMPONENT" ("ID") NULL,
"SERVICE_ID" BIGINT REFERENCES "SERVICECOMPONENT" ("ID") NULL,
-- Ensure that exactly one reference is provided.
CHECK (
("PROJECT_ID" IS NOT NULL)::INT
+ ("COMPONENT_ID" IS NOT NULL)::INT
+ ("SERVICE_ID" IS NOT NULL)::INT = 1
)
);
CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
"FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
"TO" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
"TYPE" TEXT NOT NULL,
CHECK ("TYPE" = ANY(ARRAY['DEPENDS_ON', 'PROVIDES']))
);
Or a less strict variant using JSON to store properties:
CREATE TABLE "DEPENDENCY_GRAPH_NODE" (
"ID" BIGSERIAL PRIMARY KEY,
"PROPERTIES" JSON NOT NULL
);
CREATE TABLE "DEPENDENCY_GRAPH_EDGE" (
"FROM" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
"TO" BIGINT REFERENCES "DEPENDENCY_GRAPH_NODE" ("ID") NOT NULL,
"PROPERTIES" JSON NULL
);
However, JSON support and the ability to index such columns varies wildly between RDBMSes.
We need to ensure that whatever data model we end up using, works well with the queries we're going to execute.
In the best case, the new graph structure would allow to also represent relationships between projects.
Checklist
- [X] I have read and understand the contributing guidelines
- [X] I have checked the existing issues for whether this enhancement was already requested