grabbit icon indicating copy to clipboard operation
grabbit copied to clipboard

Node Unresolved Conflicts and Doc Clarifications

Open dscirto opened this issue 9 years ago • 18 comments

We are trying to improve our AEM development practices. The first step of that for us is to determine the best path for an initial sync to local development environments. The biggest barrier has been, predictably, the time required to sync "/content" and, more specifically, "/content/dam."

What I've done so far is install Grabbit v3.0.5 on our AEM 6.0 SP3 server and v.3.0.5 on a clean, local AEM 6.2. I don't know if the versions need to be matching or not (I've had mixed results), but that may be a secondary concern. For POC purposes, I am using the Grabbit shell knowing that it's just curl-ing the servlets.

I have two issues:

  1. For this run, I am disabling all workflows through JCR properties instead of specifying them in the configuration JSON. Using a very basic config file of (sensitive information replaced):
{
    "serverUsername" : "****",
    "serverPassword" : "****",
    "serverHost" : "someserver",
    "serverPort" : "8081",
    "deltaContent" : true,
    "pathConfigurations" :  [
        {
            "path" : "/content/dam"
        }
    ]
}

What I find during this run is that I receive no errors while syncing, yet the job seems to end prematurely with a "COMPLETED" status. Is this expected behavior based on "deltaContent" or some batch/job/other configurations in the Grabbit code?

  1. For this run, I have all out-of-the-box AEM 6.2 workflows enabled via JCR properties. The configuration JSON specifies workflows to skip, based on the documentation:
{
    "serverUsername" : "****",
    "serverPassword" : "****",
    "serverHost" : "someserver",
    "serverPort" : "8081",
    "deltaContent" : true,
    "pathConfigurations" :  [
        {
            "path" : "/content/dam",
            "workflowConfigIds" :
                [
                    "/etc/workflow/launcher/config/update_asset_mod",
                    "/etc/workflow/launcher/config/update_asset_create",
                    "/etc/workflow/launcher/config/dam_xmp_writeback"
                ]
        }
    ]
}

What I've noticed with this configuration is a couple things... First, I'm knowingly not ignoring as many workflows as I would really want to (eg. workflow to pull images from .doc files). The issue with this configuration is that I often get JCR exceptions of "Unresolved conflict at /content/xyz/etc..." I'm having issues recreating the stack trace as I write this out and will update when I can do so. Part of the issue is that it's not predictable behavior. My gut tells me it's an issue of a DAM asset being added, a workflow doing something like creating renditions or parsing images, then trying to save to the JCR while Grabbit's session hasn't saved its batch to the JCR.

As I get reproducible stack traces and errors, I can update the issue. Not sure if these are config issues, workflow issues, or v.3.0.5 issues. Happy to provide any additional details as well!

dscirto avatar Oct 05 '16 17:10 dscirto

Hey @dscirto!

I'm actually super surprised you were able to install Grabbit 3 on an AEM 6.2 instance. I've never tried, and I haven't heard of anyone else who has tried either. I would be careful on that - it may produce some undefined behavior. https://github.com/TWCable/grabbit/blob/master/docs/AEMSupport.adoc defines our support for various versions of AEM. Grabbit support for AEM 6.0 was fairly short lived, so it may be a bit better; but it hasn't gotten a ton of attention.

In general, we support any configuration where servers share the same common major Grabbit version, on the same AEM platform. Grabbit has been known to be used for syncing between AEM versions, and certain Grabbit versions (e.g CQ5.6 -> AEM 6.1 w/ Grabbit 3 -> Grabbit 4), but it is sort of off-the-record, because it comes with big gotchas. The biggest of which are node definition incompatibilities between AEM versions (some I believe are with DAM content); Grabbit does not have any sort of data transformation behavior; It's simply a data sync tool.

For number 1, did the jobs status say it wrote any nodes? It is certainly possible that if you had synced that path previously; that with the delta sync option, you would see quick completion. I would take a look at the logs and see if you see anything suspicious if you think something abnormal is going on.

jbornemann avatar Oct 11 '16 05:10 jbornemann

Hey, thanks for the response!

As far as I can tell, minus the issues above, using a previous version on a newer instance of AEM 6.2 hasn't run into any problems! I did run into issues with deprecated JCR node types -- but we have deprecated our use of them so I'm just manually ignoring those paths. No harm done.

For number 1, all of the output points to a successful sync. I'm going to spin up a fresh AEM 6.2 instance (to be synced from AEM 6.0, as described previously) and turn on DEBUG log level for Grabbit and post what I find.

While that runs, I'll expand on the issue. Grabbit seems to traverse the JCR at "content/dam" normally with no exceptions. I see normal log output for assets being added properly up until the same point of the JCR (arbitrarily "content/dam/m"). At this point (and always at the same point), I get log output that the job has successfully finished with a "COMPLETED" status. The problem is that it doesn't sync "content/dam/n-z," which doesn't seem to be expected behavior.

What's funnier is that if, right after the "successful" sync, I can run the exact same job. I get minimal output, which I expect because it doesn't need to sync "content/dam/a-m." Refreshing the JCR in the target AEM, however, shows the missing "content/dam/n-z" JCR nodes. Those nodes, however, have no content in them (just the JCR tree structure) and there is no log output.

So number 1 is really two separate issues... First is that the job abbending (lack of a better term) but returning successfully. Second is that a second call to the same job seems to write the skelecton for the missing JCR nodes, but not the content. I suspect "deltaContent" may be a factor for the second issue, but not the first.

Once this sync with debugging wraps up, I'll post some logs and maybe we can get to the bottom of it.

dscirto avatar Oct 11 '16 14:10 dscirto

Here's some follow-up information.

Sync run from AEM 6.0 => AEM 6.2 both running Grabbit v3.0.5. All workflows on AEM 6.2 instance disabled via JCR properties.

config.json { "serverUsername" : "****", "serverPassword" : "****", "serverHost" : "****", "serverPort" : "8081", "pathConfigurations" : [ { "path" : "/content/dam" } ] }

Job status upon completion { "jobExecutionId": 5868304876744243000, "jcrNodesWritten": 64157, "exitStatus": { "exitDescription": "", "exitCode": "COMPLETED", "running": false }, "endTime": "2016-10-11T14:58:46+0000", "timeTaken": 607566, "path": "/content/dam", "startTime": "2016-10-11T14:48:39+0000" }

As I expected, the job finished at the same JCR node as I mentioned before and did not traverse the rest of the tree. Note that this time I removed "deltaContent" to see if it was possibly miscalculating whether or not to sync these missing nodes. Unfortunately, I did not see anything worthwhile in the DEBUG logs on either AEM server.

I ran the exact same job again to see if, as I described above, a second run would populate a skeleton of the missing JCR nodes. What I found is that, probably because I removed "deltaContent," it simply repeated the behavior of the first run.

{ "jobExecutionId": 4247691261017628700, "jcrNodesWritten": 64157, "exitStatus": { "exitDescription": "", "exitCode": "COMPLETED", "running": false }, "endTime": "2016-10-11T15:36:20+0000", "timeTaken": 524495, "path": "/content/dam", "startTime": "2016-10-11T15:27:35+0000" }

I ran the job a third time with "deltaContent" added for only the third run. This time, I got what I expected -- the missing JCR nodes now present, but without the actual content. Here is the resulting job status:

{ "jobExecutionId": 4220323512116302000, "jcrNodesWritten": 29548, "exitStatus": { "exitDescription": "", "exitCode": "COMPLETED", "running": false }, "endTime": "2016-10-11T17:30:04+0000", "timeTaken": 92767, "path": "/content/dam", "startTime": "2016-10-11T17:28:31+0000" }

dscirto avatar Oct 11 '16 17:10 dscirto

I decided to cherry pick a couple of files and look through the logs with DEBUG on -- one file that came in from the successful runs (1) and one file that had the JCR structure but no content (2):

  1. nav_aboutus2.jpg
  2. Google.pdf

What I found is that files that had their content synced successfully had two outputs in the log files (ellipses manually used to obfuscate JCR structure):

*DEBUG* [clientJobLauncherTaskExecutor-1] com.twcable.grabbit.jcr.ProtoNodeDecorator Added mixin mix:versionable for : nav_aboutus2.jpg. and *INFO* [pool-6-thread-5] com.adobe.cq.creativecloud.filesync.impl.events.SyncEventGenerator Asset added at /content/dam/.../nav_aboutus2.jpg.

However, "Google.pdf" is absent from the logs entirely.

dscirto avatar Oct 11 '16 17:10 dscirto

"While that runs, I'll expand on the issue. Grabbit seems to traverse the JCR at "content/dam" normally with no exceptions. I see normal log output for assets being added properly up until the same point of the JCR (arbitrarily "content/dam/m"). At this point (and always at the same point), I get log output that the job has successfully finished with a "COMPLETED" status. The problem is that it doesn't sync "content/dam/n-z," which doesn't seem to be expected behavior.

What's funnier is that if, right after the "successful" sync, I can run the exact same job. I get minimal output, which I expect because it doesn't need to sync "content/dam/a-m." Refreshing the JCR in the target AEM, however, shows the missing "content/dam/n-z" JCR nodes. Those nodes, however, have no content in them (just the JCR tree structure) and there is no log output"

I would try just syncing the path in question to see what happens. It could be that after the first run, the session just wasn't done persisting changes. In some configurations, you can find statements in the log like "[session-149] Saving 100 nodes..", etc. If that's not it, I would be more suspicious of the AEM 6.2 side of things (maybe a bad node definition, etc), than delta sync - just because delta sync is pretty straightforward functionality, and pretty well vetted at this point. But if there is a bug in delta sync, we definitely want to know about it! Delta sync is designed to be pessimistic; In other words, if it can't determine that a path is already up-to-date, it syncs it anyway.

"As I expected, the job finished at the same JCR node as I mentioned before and did not traverse the rest of the tree. Note that this time I removed "deltaContent" to see if it was possibly miscalculating whether or not to sync these missing nodes. Unfortunately, I did not see anything worthwhile in the DEBUG logs on either AEM server.

I ran the exact same job again to see if, as I described above, a second run would populate a skeleton of the missing JCR nodes. What I found is that, probably because I removed "deltaContent," it simply repeated the behavior of the first run."

Cool, yeah, without delta sync it should send all of the nodes previous across.

"I decided to cherry pick a couple of files and look through the logs with DEBUG on -- one file that came in from the successful runs (1) and one file that had the JCR structure but no content (2):

  1. nav_aboutus2.jpg
  2. Google.pdf

What I found is that files that had their content synced successfully had two outputs in the log files (ellipses manually used to obfuscate JCR structure):

DEBUG [clientJobLauncherTaskExecutor-1] com.twcable.grabbit.jcr.ProtoNodeDecorator Added mixin mix:versionable for : nav_aboutus2.jpg. and INFO [pool-6-thread-5] com.adobe.cq.creativecloud.filesync.impl.events.SyncEventGenerator Asset added at /content/dam/.../nav_aboutus2.jpg.

However, "Google.pdf" is absent from the logs entirely."

I don't think we log out all of the nodes we sync, but I could be wrong - too much log. You lost me a bit with the note about one having no content. Are you saying an asset node sync'd without it's binary data?

If you see what appear to be strange things happening for certain paths, I would create a configuration for just that path (to drown out noise, and eliminate side effects), then test to see what happens. I would also keep an eye on your normal error.log too; If the stack gets popped outside of a Grabbit stack frame it may end up in there instead of the batch logs. Let me know if you get better luck!

jbornemann avatar Oct 12 '16 02:10 jbornemann

I would try just syncing the path in question to see what happens. It could be that after the first run, the session just wasn't done persisting changes. In some configurations, you can find statements in the log like "[session-149] Saving 100 nodes..", etc. If that's not it, I would be more suspicious of the AEM 6.2 side of things (maybe a bad node definition, etc), than delta sync - just because delta sync is pretty straightforward functionality, and pretty well vetted at this point. But if there is a bug in delta sync, we definitely want to know about it! Delta sync is designed to be pessimistic; In other words, if it can't determine that a path is already up-to-date, it syncs it anyway.

Yeah at this point, "deltaSync" doesn't seem to be the cause of any issues. It looks like it works entirely how it's supposed to. Always good to eliminate one possibility! What I've done in the interim is write a script to retrieve every child node of "content/dam" and generate the config.json dynamically. So I end up with a config.json that looks more like this:

{ "serverUsername" : "****", "serverPassword" : "****", "serverHost" : "myserver", "serverPort" : "8081", "pathConfigurations" : [ { "path": "/content/dam/a" }, { "path": "/content/dam/b" }, { "path": "/content/dam/c" }, { "path": "/content/dam/d" }, { etc... etc... } ] }

The good news is that, at least for assets in the DAM, it seems to work as intended. Note that I still need to manually disabled all workflows, otherwise "Unresolved conflict" still gets thrown.

I don't think we log out all of the nodes we sync, but I could be wrong - too much log. You lost me a bit with the note about one having no content. Are you saying an asset node sync'd without it's binary data?

If you see what appear to be strange things happening for certain paths, I would create a configuration for just that path (to drown out noise, and eliminate side effects), then test to see what happens. I would also keep an eye on your normal error.log too; If the stack gets popped outside of a Grabbit stack frame it may end up in there instead of the batch logs. Let me know if you get better luck!

As far as I can tell, turning the log level to DEBUG for Grabbit does give output for each node that is synced... But because there are so many, I can't say with certainty. As for the bit of confusion, I might be using too many words to describe the problem. Sorry!

For example, let's say we have an arbitrary JCR such as:

/content/dam/a /content/dam/b /content/dam/c ... /content/dam/z

Specifying a path in a config.json of "content/dam" and running a job works great... up until around "content/dam/m." At this point, the job stops and returns a completed status such as posted previously. It doesn't reach n-z when syncing.

Changing absolutely nothing and simply running the same exact job, it skips a-m (which is good!) and syncs nodes n-z... But yes, without its binary data.

Manually specifying each child node under "content/dam" syncs nodes a-z exactly as intended, so there are no issues of mismatched node definitions between versions, etc etc...

dscirto avatar Oct 13 '16 14:10 dscirto

Yeah at this point, "deltaSync" doesn't seem to be the cause of any issues. It looks like it works entirely how it's supposed to. Always good to eliminate one possibility! What I've done in the interim is write a script to retrieve every child node of "content/dam" and generate the config.json dynamically. So I end up with a config.json that looks more like this:

{ "serverUsername" : "****", "serverPassword" : "****", "serverHost" : "myserver", "serverPort" : "8081", "pathConfigurations" : [ { "path": "/content/dam/a" }, { "path": "/content/dam/b" }, { "path": "/content/dam/c" }, { "path": "/content/dam/d" }, { etc... etc... } ] }

This method, unfortunately, does not seem to work outside of the DAM. When attempting to sync the rest of the nodes under "/content" in a similar configuration as above, I end up with quite a few javax.jcr.InvalidItemStateException:

13.10.2016 12:45:06.468 *ERROR* [Thread-184] com.day.cq.wcm.msm.impl.RolloutManagerImpl Exception during process of event {} com.day.cq.wcm.api.WCMException: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/somepath/home/blahblah/something/jcr:content at com.day.cq.wcm.msm.impl.RolloutManagerImpl.save(RolloutManagerImpl.java:1059) at com.day.cq.wcm.msm.impl.RolloutManagerImpl.access$1200(RolloutManagerImpl.java:106) at com.day.cq.wcm.msm.impl.RolloutManagerImpl$PageEventProcessor.call(RolloutManagerImpl.java:1198) at com.day.cq.wcm.msm.impl.RolloutManagerImpl$PageEventProcessor.call(RolloutManagerImpl.java:1152) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /content/somepath/home/blahblah/something/jcr:content at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:237) at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:212) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.newRepositoryException(SessionDelegate.java:594) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:461) at org.apache.jackrabbit.oak.jcr.session.SessionImpl$8.perform(SessionImpl.java:435) at org.apache.jackrabbit.oak.jcr.session.SessionImpl$8.perform(SessionImpl.java:432) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:216) at org.apache.jackrabbit.oak.jcr.session.SessionImpl.perform(SessionImpl.java:140) at org.apache.jackrabbit.oak.jcr.session.SessionImpl.save(SessionImpl.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.sling.jcr.base.SessionProxyHandler$SessionProxyInvocationHandler.invoke(SessionProxyHandler.java:113) at com.sun.proxy.$Proxy8.save(Unknown Source) at com.day.cq.wcm.msm.impl.RolloutManagerImpl.save(RolloutManagerImpl.java:1053) ... 7 common frames omitted Caused by: org.apache.jackrabbit.oak.api.CommitFailedException: OakState0001: Unresolved conflicts in /content/somepath/home/blahblah/something/jcr:content at org.apache.jackrabbit.oak.plugins.commit.ConflictValidator.failOnMergeConflict(ConflictValidator.java:84) at org.apache.jackrabbit.oak.plugins.commit.ConflictValidator.propertyChanged(ConflictValidator.java:60) at org.apache.jackrabbit.oak.spi.commit.CompositeEditor.propertyChanged(CompositeEditor.java:91) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.propertyChanged(EditorDiff.java:93) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareProperties(SegmentNodeState.java:596) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:456) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:531) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:487) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:418) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:418) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:418) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148) at org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:418) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583) at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:52) at org.apache.jackrabbit.oak.spi.commit.EditorHook.processCommit(EditorHook.java:54) at org.apache.jackrabbit.oak.spi.commit.CompositeHook.processCommit(CompositeHook.java:61) at org.apache.jackrabbit.oak.spi.commit.CompositeHook.processCommit(CompositeHook.java:61) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStore$Commit.prepare(SegmentNodeStore.java:405) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStore$Commit.optimisticMerge(SegmentNodeStore.java:428) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStore$Commit.execute(SegmentNodeStore.java:484) at org.apache.jackrabbit.oak.plugins.segment.SegmentNodeStore.merge(SegmentNodeStore.java:162) at org.apache.jackrabbit.oak.spi.state.ProxyNodeStore.merge(ProxyNodeStore.java:43) at org.apache.jackrabbit.oak.core.MutableRoot.commit(MutableRoot.java:247) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commit(SessionDelegate.java:313) at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.save(SessionDelegate.java:459) ... 18 common frames omitted

This is the behavior that I was experiencing in my original post, number 2. It looks like over large node sets, Grabbit runs into JCR concurrency issues.

This is the behavior that I'm most concerned about and would like to understand.

dscirto avatar Oct 13 '16 16:10 dscirto

Did you try just syncing the problem paths in isolation? e.g ('content/dam/m' from your example)?

I know your script for creating subpath configs from a path is probably just for testing, but I don't know...I just wouldn't go there. It's probably a deep, dark rabbit-hole. I would find out why it's failing when giving Grabbit a parent level config, and go from there. It looks like your stacktrace issue occurs within msm - there might be workflow nastiness associated with saving a bunch of close paths in different sessions, or something.

Also, did you already mention why you are not turning off all workflows within Grabbit?

jbornemann avatar Oct 14 '16 03:10 jbornemann

Yeah, you got it. The script to derive child nodes is just a way of testing different configs. In a perfect world, I'd love to essentially specify "/content" and call it a day. But we aren't quite there yet.

Somewhere in all of the words from above, I think I mentioned that syncing the problem paths in isolation works fine. It's a little bit of a stream of consciousness in previous posts. Using the test script to generate all child nodes in the config.json so that "/content/dam/a" through "/content/dam/z" also works... Leading me to believe that there aren't any issues with the nodes themselves.

Also, did you already mention why you are not turning off all workflows within Grabbit?

I don't think I explicitly mentioned why I am not turning off workflows within Grabbit, sorry. I actually remove the nodes under "/etc/workflow/launcher/config" to be synced from PROD later. Sledgehammer to drive a nail.

It looks like your stacktrace issue occurs within msm - there might be workflow nastiness associated with saving a bunch of close paths in different sessions, or something.

Interesting callout. I completely overlooked com.day.cq.wcm.msm in the stacktrace, but I suspect you're exactly right. It would at least explain the javax.jcr.InvalidItemStateException with "/content/a" syncs and not "/content/dam/a" syncs. I'll dig into that right now, but I bet you're 100% correct.

So let's ignore javax.jcr.InvalidItemStateException when syncing "/content/a-z" for now. I'd like to focus on the odd behavior with the DAM syncs.

dscirto avatar Oct 18 '16 13:10 dscirto

Small update:

The javax.jcr.InvalidItemStateException when syncing "/content/a-z" was, in fact, a result of MSM. Good catch on that.

So really, the only outstanding thing is the DAM syncing quirk.

dscirto avatar Oct 18 '16 16:10 dscirto

Hmm, yeah, the empty content issue is curious. I presume you found nothing of interest in the logs?

The tricky part is you mentioned it only seems to happen if n-z are synched as part of a larger dam sync. If you sync them individually it works, correct? and by empty content, you mean that the nt:file/jcr:content jcr:data property is empty on your 6.2 instance? All the other properties, and supporting nodes seem normal? I think the biggest area of suspect is the detail of syncing 6.0 -> 6.2 with Grabbit 3. Something funny might be happening in that realm.

I would first try installing the problem dam nodes into 6.2 via the package manager, and see if the package manager has to do any finagling with differing node definitions. Does it look like the dam nodes you are trying to sync are represented the same way between 6.0 and 6.2?

One more thing you could try if you want to get deeper into debugging is examining the stream that comes back from your 6.0 server. For 3.0.5, you can actually do something like this :

GET /grabbit/job?path=/content/dam/problempath

and download the stream directly to your browser. Now, our protocol basically streams binaries as-is - it says, "expect this many bytes", and "here you go" - so you should be able to find your raw jcr:data binaries embedded in the stream. You could try to find the raw binary from one of n-z within the stream, and then find a binary from a-m (which you should find). If you can't find the binary from n-z, then we know it's a problem with Grabbit even sending the binary from 6.0. If we find it within the stream, the binary is getting passed to 6.2, but something is getting lost in translation from the client-side of things, either with writing the node, a property, or etc.

jbornemann avatar Oct 19 '16 09:10 jbornemann

The tricky part is you mentioned it only seems to happen if n-z are synched as part of a larger dam sync. If you sync them individually it works, correct?

Yep, that's what makes it so strange. The issue occurs when syncing "/content/dam," but "/content/dam/n" works as intended -- so does specifying each "/content/dam/a" - "/content/dam/z"

and by empty content, you mean that the nt:file/jcr:content jcr:data property is empty on your 6.2 instance? All the other properties, and supporting nodes seem normal? I think the biggest area of suspect is the detail of syncing 6.0 -> 6.2 with Grabbit 3. Something funny might be happening in that realm.

I think I identified one point of confusion. Many of the nodes I'm referring to missing their "data" are of a jcr:primaryType=dam:Asset, not necessarily jcr:primaryType=nt:file... Though that doesn't change the issue.

The most accurate way to describe the state of the JCR is that "nt:unstructured" nodes get created for each path that fails to sync. Let me try to obfuscate some screenshots to show you exactly what I mean. I'll follow-up with that shortly.

I would first try installing the problem dam nodes into 6.2 via the package manager, and see if the package manager has to do any finagling with differing node definitions. Does it look like the dam nodes you are trying to sync are represented the same way between 6.0 and 6.2?

One more thing you could try if you want to get deeper into debugging is examining the stream that comes back from your 6.0 server. For 3.0.5, you can actually do something like this :

GET /grabbit/job?path=/content/dam/problempath

and download the stream directly to your browser. Now, our protocol basically streams binaries as-is - it says, "expect this many bytes", and "here you go" - so you should be able to find your raw jcr:data binaries embedded in the stream. You could try to find the raw binary from one of n-z within the stream, and then find a binary from a-m (which you should find). If you can't find the binary from n-z, then we know it's a problem with Grabbit even sending the binary from 6.0. If we find it within the stream, the binary is getting passed to 6.2, but something is getting lost in translation from the client-side of things, either with writing the node, a property, or etc.

It looks like DAM nodes are represented the same way between 6.0 and 6.2. Where I have run into hiccups elsewhere (like the deprecated "cq:Calendar" node type), Grabbit is pretty good about calling out the issue. So that's good at least.

I'll try some of your suggestions for further debugging after I post the aforementioned screenshots. Thanks for sticking with this!

dscirto avatar Oct 19 '16 16:10 dscirto

Step-by-step screenshots:

Fresh AEM 6.2 DAM fresh_dam

testconfig.json first_config

Resulting /content/dam after first run RED is single "sling:OrderedFolder" --> refer to as "/content/dam/a" first_run_result

Successfully synced PDF file from first run, with appropriate node properties BLUE is "dam:Asset" --> refer to as "/content/dam/a/../one.pdf" first_run_result_pdf

Same PDF, "/content/dam/a/../one.pdf" existing on the source AEM 6.0 instance first_run_result_pdf_prod

Resulting "/content/dam" after second run, same config RED is the same "sling:OrderedFolder" GREEN are new nodes resulting from second run second_run_result

Same snapshot of "/content/dam" existing on the source AEM 6.0 instance GREEN are the nodes that were falsely not picked up in first run, synced without node properties/data in second run (different node icon is an early tipoff) second_run_result_prod

GREEN is one parent node from the second run --> refer to as "/content/dam/b" YELLOW is improperly synced PDF file from second run, notice missing node properties --> refer to as "/content/dam/b/../two.pdf" second_run_result_pdf

GREEN is "/content/dam/b" existing on source AEM 6.0 instance YELLOW is "/content/dam/b/../two.pdf" existing on source AEM 6.0 instance second_run_result_pdf_prod

I hope you can follow that okay -- unfortunately, I have to keep our stuff obfuscated. So with those screenshots as a visual aid, you can see that the first Grabbit run seems to only grab one "sling:OrderedFolder" and all of the assets underneath it. The second run adds all of the nodes that were "missed" or "skipped" from the first run, but only as "nt:unstructured" nodes.

The "deltaContent" attribute in the testconfig.json DOES in fact play a role in this behavior. If I replace "/content/dam" in testconfig.json with "/content/dam/b/../two.pdf" and make no further changes and run a third time, nothing happens. I expect that because, as far as the two AEMs and Grabbit are concerned at this point, it just recently synced that content.

Now if I remove the "/content/dam/b/../two.pdf" node from the target AEM 6.2 and keep the testconfig.json the same as the third run, "/content/dam/b/../two.pdf" gets replaced in the same exact manner as the second run --> "nt:unstructured"

If I remove the node "/content/dam/b/../two.pdf" once again from the target AEM 6.2 instance AND remove "deltaContent" from testconfig.json, the fifth run replaces "/content/dam/b/../two.pdf" properly.

I'm running out of time today and will have to push your debug suggestions off until tomorrow. Hopefully this sheds a little bit of clarity on the issue.

dscirto avatar Oct 19 '16 19:10 dscirto

What type of nodes are the nodes in the green block? Other sling:OrderedFolder 's?

I have a partial script for how this might happen.

In a given green block can you find a node down the nt:unstructured tree that is filled in properly? One thing that could cause the plain-jane nt:unstructured nodes is if a child node is being saved without it's parent already present. With delta sync, Grabbit may only be sending child nodes from your AEM 6.0 instance for which it can't determine modified/creation/etc time; and since your parent nodes aren't present for whatever reason, Jackrabbit is creating nt:unstructured parents. Usually this wouldn't be an issue, because using delta content you would assert that the parents had already been successfully sync'd at some point, and are still present.

Maybe clear /var/grabbit (essentially Grabbit history) and try again to see if the unstructured nodes show up the same way?

jbornemann avatar Oct 20 '16 02:10 jbornemann

Sorry for the delayed reply -- I've been off the past couple of days.

What type of nodes are the nodes in the green block? Other sling:OrderedFolder 's?

The folders in the green blocks are supposed to be "sling:OrderedFolder." In the failed syncs, they end up as "nt:unstructured."

In a given green block can you find a node down the nt:unstructured tree that is filled in properly? One thing that could cause the plain-jane nt:unstructured nodes is if a child node is being saved without it's parent already present. With delta sync, Grabbit may only be sending child nodes from your AEM 6.0 instance for which it can't determine modified/creation/etc time; and since your parent nodes aren't present for whatever reason, Jackrabbit is creating nt:unstructured parents. Usually this wouldn't be an issue, because using delta content you would assert that the parents had already been successfully sync'd at some point, and are still present.

This has to be the core issue, then. Yes, I can drill down and find a child node under an "nt:unstructured" node (green block) that is synced properly. There is no color code to this screenshot -- every node except the one shown is an "nt:unstructured" node that really should be some other node type. It illustrates the behavior you're describing.

xmp

So that behavior explains why on the second run of a job with "deltaContent", a bunch of "nt:unstructured" nodes are present. It still doesn't explain why those parent nodes get skipped in the first place (regardless of "deltaContent").

It's as if whatever iteration/recursion over the immediate child nodes of "/content/dam" doesn't have accurate knowledge of the existence of those immediate child nodes.

Maybe clear /var/grabbit (essentially Grabbit history) and try again to see if the unstructured nodes show up the same way?

I've tried this at different points (mostly to just clear the noise when checking the status of jobs) and it doesn't seem to change any results.

dscirto avatar Oct 24 '16 14:10 dscirto

Hey @dscirto any progress on this?

Have you tried eliminating the difference of platform from the equation? Maybe like using package manager to transfer over some of these nodes that Grabbit isn't syncing correctly to one 6.2 instance, and syncing to another 6.2 instance. If you see the same behavior, then we at least know it's not an issue with 6.0 -> 6.2 sync.

jbornemann avatar Nov 03 '16 08:11 jbornemann

I haven't had as much dedicated time the past week for this exact issue, but I've done some other syncing with different environments that makes me lean towards it being an issue with 6.0 => 6.2

We have a need to sync a 6.1 => 6.2 environment and, in short, have had no issues syncing these instances. This is using Grabbit 7.0.2 for both instances, so there are a lot of different variables, but it's something.

When I have spare time, I am going to try and trigger a sync from the 6.0 instance I've been using, but sync it to a fresh local 6.0 ... Try and eliminate what the issue could be.

One more thing you could try if you want to get deeper into debugging is examining the stream that comes back from your 6.0 server. For 3.0.5, you can actually do something like this :

GET /grabbit/job?path=/content/dam/problempath

This is also high on my list to try.

dscirto avatar Nov 03 '16 14:11 dscirto

Cool, just a note @dscirto if you try looking at the binary stream for 7.0.2; the resource path is a bit different. We changed the resource path to be a bit more accurate a couple versions back:

GET /grabbit/content?path=/content/dam/problempath

jbornemann avatar Nov 04 '16 02:11 jbornemann