nifi icon indicating copy to clipboard operation
nifi copied to clipboard

NIFI-12918 Fix Stateless NullPointerException on versioned sub-process groups - main branch

Open slambrose opened this issue 1 year ago • 8 comments

Summary

NIFI-12918

Here is the PR for the 1.x support branch: https://github.com/apache/nifi/pull/8533

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • [X] Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • [X] Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • [X] Pull Request based on current revision of the main branch
  • [X] Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • [X] Build completed using mvn clean install -P contrib-check
  • [X] JDK 21

Licensing

  • [X] New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • [X] New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • [X] Documentation formatting appears as expected in rendered files

slambrose avatar Mar 19 '24 17:03 slambrose

@slambrose Can you please determine whether the failures in the builds correspond to your changes?

dan-s1 avatar Mar 20 '24 17:03 dan-s1

Does the current main branch even pass the integration tests? When I build locally with the tests, they fail on the main branch.

slambrose avatar Mar 21 '24 14:03 slambrose

@slambrose Yes. https://github.com/apache/nifi/actions. When I build locally they work fine and that link suggests they succeed quite often including recently. They still aren't 100% bulletproof and as such do have failures at times and tremendous effort has gone into improving them. But I interpret David's note to mean he looked at them and found it is likely related to this change.

Fundamentally if your question "Are these system/integration tests important to verify for changes like this?" The answer is yes they're very important as a safeguard to ensure changes at the framework level have some sanity checking.

joewitt avatar Mar 21 '24 14:03 joewitt

Just to highlight the most recent failures, the following tests and methods are visible in the system test action logs.

Error:    ClusteredRegistryClientIT>RegistryClientIT.testChangeConnectionDestinationRemoveOldAndMoveGroup:163 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 61655b6e-018e-1000-ffff-ffffb050df1b.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testChangeVersionOnParentThatCascadesToChild:92 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 616560a1-018e-1000-0000-0000153b8e9e.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testChangeVersionWithPortMoveBetweenGroups:262 » NiFiClient Error creating process group: An unexpected error has occurred. Please check the logs for additional details.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testControllerServiceUpdateWhileRunning:218 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 616569b2-018e-1000-0000-0000519c22d0.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testCopyPasteProcessGroupDoesNotDuplicateVersionedComponentId:382 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 616563fc-018e-1000-ffff-fffff320df4a.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testCopyPasteProcessGroupUnderVersionControlMaintainsVersionedComponentId:432 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 616555c6-018e-1000-ffff-fffff72dc814.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testRollbackOnFailure:309 » NiFiClient Error creating process group: An unexpected error has occurred. Please check the logs for additional details.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testStartVersionControlThenImport:332 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 61656d21-018e-1000-ffff-ffffa5788503.
Error:    ClusteredRegistryClientIT>RegistryClientIT.testStartVersionControlThenModifyAndRevert:355 » NiFiClient Error starting version control: Failed to update Version Control Information for Process Group with ID 616566d5-018e-1000-ffff-ffffc48d8b9f.
Error:    RegistryClientIT.testChangeConnectionDestinationRemoveOldAndMoveGroup:163 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testChangeVersionOnParentThatCascadesToChild:92 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testChangeVersionWithPortMoveBetweenGroups:262 » NiFiClient Error creating process group: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testControllerServiceUpdateWhileRunning:218 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testCopyPasteProcessGroupDoesNotDuplicateVersionedComponentId:382 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testCopyPasteProcessGroupUnderVersionControlMaintainsVersionedComponentId:432 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testRollbackOnFailure:309 » NiFiClient Error creating process group: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testStartVersionControlThenImport:332 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    RegistryClientIT.testStartVersionControlThenModifyAndRevert:355 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.
Error:    StatelessBasicsIT.testChangeFlowVersion:741 » NiFiClient Error starting version control: An unexpected error has occurred. Please check the logs for additional details.

exceptionfactory avatar Mar 21 '24 15:03 exceptionfactory

Thanks, I see it now. Looks like there was an integration test that was added in 2.0 that isn't good. The test that's failing looks like it should have picked up the bug that I'm attempting to fix, but it does not. I'll add a change to the test class so that it doesn't break anymore.

slambrose avatar Mar 21 '24 15:03 slambrose

So, here is my conclusion on this. Unfortunately, I'm not sure I have the cycles to do the full re-write that this code probably requires. The main and 1.x support branches currently do not work to process stateless flows that have sub-versioned flows. It's broken in current state. My two small changes fix both the main and 1.x support branches in terms of processing the flows (tested and confirmed); however, there are 9 integration tests that appear to be trying to test this exact use case. Clearly, these these 9 tests were bad to begin with because they don't catch the bug at all. I'm not sure what actually changed in the code base such that these 9 tests fail in the main branch but not the 1.x support branch. I see many places where this concept of a "registryId" is not fully programmed out, so making a require non-null on it does not work. I'm proposing an acceptance of these merge requests with the broken tests commented out to fix processing in production, with a continued ticket to investigate coding this out completely and fixing the integration tests at a later time. Otherwise, the product remains broken indefinitely.

slambrose avatar Mar 25 '24 11:03 slambrose

Okay, I'm trying one more solution. What I've found is that the main branch has new code where a "registryId" is implemented, but it is always null right now because the registry-api does not currently supply a "registryId" property value. The integration tests hard-code the creation of a registry UUID to simulate testing the bug that currently exists in 1.x and 2.0 (main) branches. This will put in a UUID placeholder and allow the integration tests to pass, while also fixing the current broken stateless flows in 1.x and 2.0.

slambrose avatar Mar 26 '24 18:03 slambrose

Okay, so after some back and forth here is my final conclusion to this bug:

I've submitted two PRs https://github.com/apache/nifi/pull/8572/checks <-- for support/1.x branch https://github.com/apache/nifi/pull/8536/checks <-- for main branch

Here is what's happening. If you run a versioned stateless flow that has sub-versioned progress group, there is a NullPointerException thrown on the build method of StandardVersionControlInformation where there is a non-null requirement on "registryId".

Code has changed in main versus support/1.x, but the bug exists on both. What I've found is the JerseyClient calls that are made to map the response from the registry-api back to the higher level VersionedProcessGroup -> VersionedFlowCoordinates both return JSON where the VersionedFlowCoordinates > registryId is always null. Now in the main branch, stateless uses some of these new "synchronizer" classes, so the best place to insert a solution is to add an else on the null check where the code runs the "determineRegistryId" method and set the value to "1" for versioned flows that do not have a registryId associated to them. Doing this passes all of the integration tests and code checks and fixes the bug. Since this class does not exist in the 1.x support branch, my best solution is to just comment out the null check (this breaks integration tests in main, but not in 1.x).

I continued to look further at this idea of a "registryId" within the registry api code. The GET method on buckets/flow/version returns the same class that the stateless code uses to map into on the JerseyClient call. I then looked at the api code to POST new versioned flows, and it also expect the json as a parameter to match the same class. Well, since there is no "non-null" requirement on the registryId, versioned flows are stored in the database (or whichever storage adapter used) with this property as null. I did not check to see if the latest NiFi Registry UI is now setting this property, but making it non-null would break any older version of registry and not be backwards compatible. Therefore, the only solution at this point to fix stateless NiFi in its current state is to just set registryId to "1" if it is null. I believe this code is still probably prototype being worked and evolving, so this will be a temporary fix until those mandates are in place on a concept of "registryId".

slambrose avatar Mar 27 '24 13:03 slambrose

Thanks for taking another look :)

So, the storageLocation change is still needed to fix the bug and process the flow in stateless. Otherwise, when it loops down to pull the sub-versioned process flow, it'll append the /{bucketId}/{flowId} parameters twice and make a bad call to the registry-api. The best way I came up to fix that was just to make sure to start with the base registry url and append the storage location parameters to always construct a valid registryUrl. I noticed the 1.x support branch was originally setting the registryUrl to coordinates.getRegistryUrl(), which was changed to coordinates.getStorageLocation() in main. This fix works for the 1.x branch as well.

Do you have another value other than "1" you'd prefer me to set it to? FYI - it has to be a static value in order for all of the current tests to work.

I can work on adding some unit tests for both of these fixes. I'll add a comment here when I commit them and are ready for review.

slambrose avatar Mar 27 '24 18:03 slambrose

Ready for re-review.

slambrose avatar Mar 28 '24 17:03 slambrose

Thanks for working on the unit tests @slambrose. The test for the synchronizer class looks like it exercises the method changes, but the TestRegistryUtil appears to be running against a mock of RegistryUtil, as opposed to the actual class, so it doesn't exercise the real method.

Ready for re-review.

slambrose avatar Apr 03 '24 11:04 slambrose

@exceptionfactory Okay, this should be good now.

slambrose avatar Apr 04 '24 16:04 slambrose

Thanks for the latest updates @slambrose, this is looking closer to completion.

Changing the InMemoryFlowRegistry to return true makes sense, as it is only used in Stateless operation, and it is the only Flow Registry Client available when running. It would be helpful to add a basic unit test class for InMemoryFlowRegistry, just to track that expected behavior.

The other consideration is the base Registry URL. Although the current getBaseRegistryUrl() method works when a Registry server is accessible directly, if there is a reverse proxy in front that has an additional context path, rebuilding the URL with just the host and port presents a problem. For example:

https://nifi-registry.local/context-path/nifi-registry-api

The method would return the following:

https://nifi-registry.local

The Registry Client would then attempt to append /nifi-registry-api as the context path from the root, which would result in failures.

In light of the fact that the storageLocation includes the full path, it seems like a regular expression match would address all scenarios. Something along the lines of the following pattern:

^(https?://.+?/nifi-registry-api).*$

With that pattern, the first capturing group can then be used at the registry URL.

Additional Unit tests will be address in future work for InMemoryFlowRegistry class. Regex change has been added. Ready for re-review @exceptionfactory

slambrose avatar Apr 05 '24 18:04 slambrose