OpenUSD Variant sets on the specialized prims aren’t being expanded properly

Description of Issue

When Specializing a prim from within a Variant, other Variants on the Specialized prim itself are being ignored.

Originally reported here: https://forum.aousd.org/t/quirk-with-variants-specializes/1787

Steps to Reproduce

Given the following layer:

#usda 1.0
class "_class_"
{
    class "asset"
    {
        class "representations"
        {
            class "modelHero" (
                variants = {
                    string lookdev = "v2"
                }
                prepend variantSets = "lookdev"
            )
            {
                def "geo"
                {
                    def "foo"
                    {
                    }
                }
                variantSet "lookdev" = {
                    "v2" {
                        def "mtl"
                        {
                            def Material "myMaterial"
                            {
                            }
                        }

                    }
                }
            }
        }
    }
}

def Xform "element" (
    variants = {
        string mode = "foo"
    }
    prepend variantSets = "mode"
)
{
    variantSet "mode" = {
        "foo" (
            prepend specializes = </_class_/asset/representations/modelHero>
        ) {

        }
    }
}

In the composed stage, the modelHero prim has the mtl/myMaterial prim as expected but the element prim does not.

Other observations of note:

Using Inherits instead of Specializes gives the expected result
Adding an indirection between element and modelHero by specializing a different prim which then specializes modelHero gives the expected result
Specializing the representations parent prim rather than modelHero gives the expected result, albeit with an extra representations prim underneath the element

System Information (OS, Hardware)

Ubuntu 22.04

Package Versions

USD-23.08

Build Flags

Aug 21 '24 22:08 tymonpitts

Filed as internal issue #USD-10010

Aug 22 '24 16:08 jesschimein

I've run into another example of what I believe is this same bug but the observed behaviour is slightly different. In the original example the contents of the variant are lost while content from other specs are present. In the following example, the contents of the variant are present but the content of a specializing prim are lost:

#usda 1.0

def "Tree"
{
    class "representations"
    {
        def "modelHero"
        {
            def "geo"
            {
                over "leaves"
                {
                    def "Prototypes" (
                        prepend specializes = </Foliage/render>
                    )
                    {
                    }
                }
            }
        }
    }

    def "render" (
        prepend specializes = </Tree/representations/modelHero>
    )
    {
    }
}

def "Foliage" (
    prepend references = </FoliageImpl>
)
{
}

class "FoliageImpl"
{
    class "representations"
    {
        def "modelHero" (
            variants = {
                string testVariantSet = "testVariant"
            }
            prepend variantSets = "testVariantSet"
        )
        {
            variantSet "testVariantSet" = {
                "testVariant" (
                    customData = {
                        string testVariantMetadata = "variantValue"
                    }
                )
                {
                    uniform string testVariantAttr = "variantValue"

                    def "testVariantChild"
                    {
                    }
                }
            }
        }
    }

    def "render" (
        customData = {
            string testRenderPrimMetadata = "renderPrimValue"
        }
        prepend specializes = </FoliageImpl/representations/modelHero>
    )
    {
        uniform string testRenderPrimAttr = "renderPrimValue"

        def "testRenderPrimChild"
        {
        }
    }
}

In the composed stage I'm expecting /Tree/render/geo/leaves/Prototypes to have:

testRenderPrimChild child prim
testVariantChild child prim
testRenderPrimAttr attribute set to "renderPrimValue"
testVariantAttr attribute set to "variantValue"
testRenderPrimMetadata entry in the customData set to "renderPrimValue"
testVariantMetadata entry in the customData set to "variantValue"

However what I see is only:

testVariantChild child prim
testVariantAttr attribute set to "variantValue"
testVariantMetadata entry in the customData set to "variantValue"

All data authored on /FoliageImpl/render is lost on /Tree/render/geo/leaves/Prototypes but is present on /Tree/representations/modelHero/geo/leaves/Prototypes which /Tree/render/geo/leaves/Prototypes is specializing.

I did try to simplify this repro case even more but interestingly if you change almost anything about the composition structure then it starts working as expected. For example, flattening either the Foliage -> FoliageImpl reference arc or the /FoliageImpl/render -> /FoliageImpl/representations/modelHero specialization then the scene composes as expected.

Sep 11 '24 03:09 tymonpitts

And here's a variation of the original example with slightly different composition structure:

#usda 1.0
class "_class_asset"
{
    class "representations"
    {
        def "modelHero" (
            prepend specializes = </_class_defaultVariantSelections>
            prepend variantSets = "geometryMode"
        )
        {
            variantSet "geometryMode" = {
                "usd" {
                    def "geo"
                    {
                    }

                }
            }
        }
    }

    def "render" (
        prepend specializes = </_class_asset/representations/modelHero>
    )
    {
    }
}

class "_class_defaultVariantSelections" (
    prepend variantSets = "geometryMode"
    variants = {
        string geometryMode = "usd"
    }
)
{
    variantSet "geometryMode" = {
        "usd" {

        }
    }
}

def Xform "Element" (
    prepend specializes = </_class_asset>
)
{
}

In this case I'm expecting /Element/render/geo to exist but it does not. If you do the variant selection directly on /_class_asset/representations/modelHero it still doesn't work but if you also remove the /_class_defaultVariantSelections class completely then it works as expected. You can also Inherit or Reference /_class_defaultVariantSelections rather than Specialize and it works as expected.

Sep 11 '24 03:09 tymonpitts

These are all simplified examples of issues we've run into in production use cases. Thankfully we've discovered these cases before widespread rollout of our planned switch from Inherits to Specializes in some use cases. Due to this bug we are re-evaluating our options (Inherits has issues for our workflows as well which prompted the switch).

Do you have an ETA on when this issue will be looked into? Currently Specializes is very unstable but we might risk using it in a select few cases where it doesn't currently present with this bug. Having a rough estimate on a timeline for a fix will help us make that decision.

Sep 11 '24 03:09 tymonpitts

Hi @tymonpitts , what I can share right now is that it is a project we anticipate starting when the work to deboostify our python wrappings completes, which will likely be early next year. The project to rework specializes will not be trivial, and has not been bid yet. So, we hope summer/fall 2025, but cannot promise it, as production priorities shift.

Thanks for your patience.

Sep 12 '24 00:09 spiffmon

This bug is now blocking some workflows we're trying to implement so I've taken a deeper look into the relevant code in primIndex.cpp to see if we can fix the immediate issue ourselves and submit a pull request. I've discovered that there's actually 2 bugs at play here. The original repro case is caused by variant nodes missing from the prim index all together. The other repro cases have the appropriate nodes but both the origin node AND the propagated node are marked inert.

(Note all of my debugging has been done with 23.08 since our studio can't yet build later versions internally but I presume the results would be similar enough since the bug(s) were confirmed here as still present in later versions as well)

Bug 1: Missing Variant Nodes

I've simplified the repro case even further to rule out ancestral opinions as well as just make it easier to debug:

#usda 1.0
class "A" (
    variants = {
        string nestedVariantSet = "nestedVariant"
    }
    prepend variantSets = "nestedVariantSet"
)
{
    variantSet "nestedVariantSet" = {
        "nestedVariant" {
            custom string test = "nestedVariant"
        }
    }
}

def "B" (
    variants = {
        string introducingVariantSet = "introducingVariant"
    }
    prepend variantSets = "introducingVariantSet"
)
{
    variantSet "introducingVariantSet" = {
        "introducingVariant" (
            prepend specializes = </A>
        ) {
        }
    }
}

This is the prim index dot graph for /B which you can see is missing the variant child nodes under both the origin and propagated specialize nodes:

I identified the following sequence of events that lead to that bugged behaviour:

node (2) is created with shouldContributeSpecs=true (inert=false)
EvalNodeVariantSets task for node (2) is added
node (3) is created with shouldContributeSpecs=false (inert=true)
- note that the inert state of node (2) and (3) are currently inverted from what they will be in the final graph
EvalImpliedSpecializes task for node (3) is added
- note that no EvalNodeVariantSets task is added for node (3) because it is currently inert
since tasks are sorted according to a hardcoded priority order, the EvalImpliedSpecializes task for node (3) runs before the EvalNodeVariantSets task for node (2):
- propagates child nodes from node (2) to node (3) (there are none yet in this case)
- marks node (3) with inert=false
- marks node (2) with inert=true
EvalNodeVariantSets task for node (2) runs which does nothing because the node is now inert
Since node (3) didn't get an EvalNodeVariantSets task because it was initially inert and node (2)'s EvalNodeVariantSets did nothing because it was marked inert before the task ran, we end up with no variant nodes created.

This sequence of events also explains why the bug is not triggered when you remove the introducingVariantSet and just specialize directly. In that case there is no need to propagate the specialize node to the root of the graph because it's already at the root so there is no toggling of inert states and EvalNodeVariantSets runs just fine.

My proposed fix is when node (3) is created, in _AddTasksForNodeRecursively I added a check for this case and add an EvalNodeVariantSets task for node (3) even though the node is currently inert:

Index: pxr/usd/pcp/primIndex.cpp
<+>UTF-8
===================================================================
diff --git a/pxr/usd/pcp/primIndex.cpp b/pxr/usd/pcp/primIndex.cpp
--- a/pxr/usd/pcp/primIndex.cpp (revision c64d776eb3d71a720a2faebe43000362f531f3a3)
+++ b/pxr/usd/pcp/primIndex.cpp (revision e35d97e22518d4f4e61c8ff934bda85bc961fd13)
@@ -844,6 +844,10 @@
     TF_ADD_ENUM_NAME(Task::None);
 }
 
+static bool
+_IsPropagatedSpecializesNode(
+    const PcpNodeRef& node);
+
 // Pcp_PrimIndexer is used during prim cache population to track which
 // tasks remain to finish building the graph.  As new nodes are added,
 // we add task entries to this structure, which ensures that we
@@ -992,13 +996,33 @@
         // If the node does not have specs or cannot contribute specs,
         // we can avoid even enqueueing certain kinds of tasks that will
         // end up being no-ops.
-        const bool contributesSpecs = n.HasSpecs() && n.CanContributeSpecs();
+        //
+        // Note that even when CanContributeSpecs is false, we still need to
+        // add EvalNodeVariantSets tasks in some cases due to the way node
+        // inert states are toggled during task evaluation.
+        //
+        // The node might be marked as inert right now but maybe activated
+        // during EvalImpliedSpecializes. If that happens and we haven't added
+        // a EvalNodeVariantSets task here then specialize nodes can end up
+        // with missing variant nodes. See testPcpRegressionBugs_USD-10010
+        // test case
+        const bool contributesAllSpecs = n.HasSpecs() && n.CanContributeSpecs();
+        const bool contributesJustVariantSpecs =
+            !contributesAllSpecs &&
+            _IsPropagatedSpecializesNode(n) &&
+            n.IsInert() &&
+            n.HasSpecs();
 
         // Preflight scan for arc types that are present in specs.
         // This reduces pressure on the task queue, and enables more
         // data access locality, since we avoid interleaving tasks that
         // re-visit sites later only to determine there is no work to do.
-        const size_t arcMask = contributesSpecs ? _ScanArcs(n) : 0;
+        size_t arcMask = 0;
+        if (contributesAllSpecs) {
+            arcMask = _ScanArcs(n);
+        } else if (contributesJustVariantSpecs) {
+            arcMask = _ScanArcs(n) & _ArcFlagVariants;
+        }
 
         // If the caller tells us the new node and its children were already
         // indexed, we do not need to re-scan them for certain arcs based on
@@ -3107,10 +3131,6 @@
     }
 }
 
-static bool
-_IsPropagatedSpecializesNode(
-    const PcpNodeRef& node);
-
 static void
 _EvalImpliedClasses(
     PcpPrimIndex *index,

The EvalNodeVariantSets for node (2) still does nothing but then EvalNodeVariantSets for node (3) runs and since node (3) was set as inert=false during EvalImpliedSpecializes it actually adds the appropriate node for the variant. That variant node is then propagated up to node (2) and results in the following graph:

I'm reasonably confident that the fix will not introduce undesired effects because the conditions are pretty specific and even if those conditions are met in non-bugged cases, all I'm doing is adding an EvalNodeVariantSets task to be run later. From what I can tell, all of these tasks are designed to do nothing if they don't need to. You just add them if something might need to be evaluated and the only reason they are not added all the time is simply for performance.

Thoughts?

Bug 2: Both origin and propagated nodes become inert

(Should I submit a separate ticket for this one?)

Similar to the first bug, I've simplified the repro cases to aid debugging.

Case 1

#usda 1.0

def "element" (
    prepend specializes = </_element_class>
)
{
}

class "_element_class" (
    prepend specializes = </middleman>
)
{
}

class "middleman" (
    prepend references = </referencedMiddleman>
)
{
}

class "referencedMiddleman" (
    prepend specializes = </implementation>
)
{
    uniform string referencedAttr = "referencedValue"
}

class "implementation" (
    variants = {
        string testVariantSet = "testVariant"
    }
    prepend variantSets = "testVariantSet"
)
{
    variantSet "testVariantSet" = {
        "testVariant"
        {
            uniform string variantAttr = "variantValue"
        }
    }
}

Gives the following graph:

Where you can see nodes (3) and (7) are both marked inert and so the opinions expressed directly on /referencedMiddleman do nothing.

Case 2

#usda 1.0
def "implementation" (
    prepend specializes = </_class_defaultVariantSelections>
    prepend variantSets = "testVariantSet"
)
{
    variantSet "testVariantSet" = {
        "testVariant" {
            custom string variantAttr = "hello world"
        }
    }
}

def "middleman" (
    prepend specializes = </implementation>
)
{
}

class "_class_defaultVariantSelections" (
    prepend variantSets = "testVariantSet"
    variants = {
        string testVariantSet = "testVariant"
    }
)
{
    variantSet "testVariantSet" = {
        "testVariant" {

        }
    }
}

def "element" (
    prepend specializes = </middleman>
)
{
}

Gives the following graph:

Where you can see nodes (3) and (7) are both marked inert and so the opinions expressed in that variant do nothing. I've identified the following sequence of events that leads to this bugged behaviour (node numbers below refer to the nodes in Case 2's graph):

Node (1) is created with EvalNodeSpecializes and EvalImpliedSpecializes tasks
EvalNodeSpecializes task runs for Node (1) and creates Node (2) which then adds an EvalNodeSpecializes task for the new node
EvalNodeSpecializes task runs for Node (2) and creates Node (4)
EvalImpliedSpecializes task runs for Node (1):
- propagates Node (2) to the root as Node (6) and
- adds EvalVariantSets task to Node (6)
- marks Node (6) as inert=false
- marks Node (2) as inert=true
- propagates Node (4) to the root as Node (8) and:
- adds EvalVariantSets task to Node (8)
EvalNodeVariantSets task runs for Node (6) and adds an EvalNodeVariantAuthored task for it
EvalNodeVariantSets task runs for Node (8) and adds an EvalNodeVariantAuthored task for it
EvalNodeVariantAuthored task runs for Node (6) and creates Node (7) with an EvalImpliedSpecializes task
EvalImpliedSpecializes task runs for Node (7):
- propagates Node (7) back up to it's origin as Node (3):
- marks Node (3) as inert=false
- marks Node (7) as inert=true
- adds an EvalImpliedSpecializes task for Node (1)
EvalImpliedSpecializes task runs for Node (1)
- marks Node (7) as inert=false
- marks Node (3) as inert=true
EvalNodeVariantAuthored task runs for Node (8) and creates Node (9) and adds an EvalImpliedSpecializes task to Node (8)
EvalImpliedSpecializes task runs for Node (8):
- propagates Node (9) back to it's origin as Node (5)
- adds an EvalImpliedSpecializes task for Node (1)
EvalImpliedSpecializes task runs for Node (1)
- marks Node (7) as inert=true
- marks Node (3) as inert=true

Note that the last step there marked both nodes as inert=true which seems wrong. The problem is this code in _PropagateNodeToParent:

            newNode.SetInert(srcNode.IsInert());
            newNode.SetHasSymmetry(srcNode.HasSymmetry());
            newNode.SetPermission(srcNode.GetPermission());
            newNode.SetRestricted(srcNode.IsRestricted());

            srcNode.SetInert(true);

When srcNode is already inert this will make both nodes inert which is what happens in that last step.

You can also see in the series of events why removing the /_class_defaultVariantSelection prim avoids the bug. It's the last EvalImpliedSpecializes task that gets added because it's propagating the variant node for /_class_defaultVariantSelection back to it's origin node. Without that task being added again the inert states for nodes (7) and (3) are left alone.

The naive solution is to toggle the inert state of srcNode rather than hardcoding it to srcNode.SetInert(true). I tried that though and while it works in this specific simplified case, on a more complex case it ends up making the origin node active rather than the propagated root node so the strength ordering is incorrect.

The issue seems to be that _PropagateNodeToParent gets called in 2 different contexts: _PropagateSpecializesTreeToRoot and _PropagateArcsToOrigin. I'm thinking that instead of infering the inert state of newNode from srcNode I can add an argument to _PropagateNodeToParent which describes which node should be inert and which active. _PropagateArcsToOrigin would always make srcNode active and _PropagateSpecializesTreeToRoot would always make newNode active. This would ensure that the implied node is always active and the origin node inert.

Thoughts? Is it safe to assume that for the Specialize arc, the implied node should always be active while the origin is inert? Or are there cases that I'm not considering? Do ancestral opinions throw a wrench in that?

Feb 05 '25 02:02 tymonpitts

@tymonpitts Hi Tymon, sorry for the late reply. This is fantastic digging, thank you so much for the detailed analysis!

As it so happens I am currently in progress on a rewrite of the specializes arc handling in Pcp that avoids the complicated propagation of subtrees around the prim index. I'm hopeful that this will "just work" and resolve the issues you're seeing -- at the very least the second bug you've noted, since that seems directly related to the propagation steps I'm removing.

I'm tracking this issue as part of my rewrite. I'll circle back around to verify my changes fix these issues, or work on a fix if they don't. Thank you again!

Mar 11 '25 17:03 sunyab

@sunyab that's great to hear that you're working on getting rid of the subtree propagation! I found that to be a huge headache especially with how it toggles inert states and propagates in both directions. It's very brittle currently.

I've actually already implemented my proposed fixes on our internal 23.08 branch that we're using in production since we likely won't be able to switch to 25.x for a while. I was wondering if I could still submit a PR just for you to review my changes and let me know if you think it will cause issues for us in production? The second bug ended up being a little more complicated to fix. All of the tests are passing and I also compared the Pcp graph before and after my changes on a full shot with no issues but I'm curious if you would have any insights about anything I may have broken.

Mar 11 '25 21:03 tymonpitts

@tymonpitts Sure, I can try to take a look and run the changes through our internal test suite to see if anything falls out. But just to manage expectations, the specializes handling is tricky enough (as you're well aware) that I wouldn't be super confident I'd notice anything from code inspection.

Mar 17 '25 05:03 sunyab

Thanks @sunyab, I've gotten a commit pushed to our fork that fixes both bugs: https://github.com/PixarAnimationStudios/OpenUSD/commit/d7697f1f041ff056b49a9513279722e810df5fe0#r153878935

I wasn't able to submit a Draft PR for you to review (Pull request creation failed. Validation failed: must be a collaborator) but I've made some comments directly on the commit and tagged you in them. Hopefully that's good enough but please let know otherwise and perhaps we can get the "must be a collaborator" issue sorted.

Mar 18 '25 03:03 tymonpitts

@tymonpitts I posted a few comments on that commit, but tl;dr: the changes seem good to me, our internal test suite didn't run into any issues with the changes and performance tests also looked good. Not a guarantee there isn't a weird corner out there that might have problems, but hopefully good enough for your needs.

I've verified your test cases all work as expected with the rewritten specializes handling in my dev tree. I'll be adding these test cases to the repo as part of my changes too.

Mar 24 '25 17:03 sunyab

ok awesome, that's good to hear! We'll move forward with these changes internally and I'm glad the test cases will be useful for you. Thanks for taking a look!

Mar 24 '25 21:03 tymonpitts