nifi icon indicating copy to clipboard operation
nifi copied to clipboard

NIFI-13430 Added CopyS3Object and ExistsS3Object

Open MikeThomsen opened this issue 1 year ago • 8 comments

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • [ ] Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • [ ] Pull Request based on current revision of the main branch
  • [ ] Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • [ ] Build completed using mvn clean install -P contrib-check
    • [ ] JDK 21

Licensing

  • [ ] New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • [ ] New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • [ ] Documentation formatting appears as expected in rendered files

MikeThomsen avatar Jun 21 '24 01:06 MikeThomsen

@exceptionfactory Alright. I'll open a PR for this functionality on support/nifi-1.x and then start refactoring this PR to use AWS 2.X APIs. Does that work for you?

MikeThomsen avatar Jun 21 '24 09:06 MikeThomsen

@exceptionfactory Alright. I'll open a PR for this functionality on support/nifi-1.x and then start refactoring this PR to use AWS 2.X APIs. Does that work for you?

I recommend refactoring first, as the AWS SDK 2 libraries are also available on the support branch. Working through the changes on the main branch first should limit the amount of work necessary to consider back-porting.

exceptionfactory avatar Jun 21 '24 12:06 exceptionfactory

Ok. Will do. I will probably have to massively expand the scope of this PR but it is what it is...

MikeThomsen avatar Jun 21 '24 13:06 MikeThomsen

@exceptionfactory I did some experimentation with this today, and there is going to be a lot of refactoring work in both the abstract processors module and the aws processors module to get the S3 stuff migrated. I'd rather add these with the V1 API since they're thin wrappers over that API and then do a bigger ticket afterward to try to get everything on the V2 API.

I see your point, but I'd rather get this functionality in now so it's available because we have some use cases that really need it. I don't think it'll add much to that refactoring task.

MikeThomsen avatar Jun 23 '24 20:06 MikeThomsen

Thanks for the follow up @MikeThomsen.

Incorporating new Processors and refactoring in the same pull request for S3 doesn't sound ideal, and I agree that the scope of these Processors is fairly narrow in terms of S3 API changes. With that in mind, considering these Processors as they stand seems like a reasonable way forward.

To that end, there are a few things I recommend evaluating. The Copy Processor seems straightforward, and I can take a closer look at potential changes there. I would note that new properties should just use a human-readable name() instead of a different displayName() going forward.

On the Exists Processor, I think it would be better to have it function as a GetS3ObjectMetadata Processor. The doesObjectExist() method requests the metadata, but ignores the response. I think having the metadata would present other potential uses beyond just whether the object exists, while still supporting that particular use case.

exceptionfactory avatar Jun 24 '24 20:06 exceptionfactory

strong +1 on David's comments regarding GetS3ObjectMetadata instead of an exists call.

joewitt avatar Jun 24 '24 20:06 joewitt

I'll take a look into that @joewitt @exceptionfactory but it might be right to have 3 processors instead. I recall something about how doesObjectExists uses a HEAD which is faster for S3 than trying to fetch the metadata when your only goal is to ask whether the object exists. I'll look into it and adjust accordingly.

MikeThomsen avatar Jun 25 '24 16:06 MikeThomsen

I'll take a look into that @joewitt @exceptionfactory but it might be right to have 3 processors instead. I recall something about how doesObjectExists uses a HEAD which is faster for S3 than trying to fetch the metadata when your only goal is to ask whether the object exists. I'll look into it and adjust accordingly.

@MikeThomsen The doesObjectExist() method calls getObjectMetadata(), so there should be no difference in performance. That's the reason I suggested going that direction, thus I would not recommend a separate processor for checking object existence.

exceptionfactory avatar Jun 25 '24 16:06 exceptionfactory

@jrsteinebrey @exceptionfactory we should be GTG now on the changes.

MikeThomsen avatar Jul 10 '24 14:07 MikeThomsen

@jrsteinebrey @exceptionfactory we should be GTG now on the changes.

Thanks for the updates @MikeThomsen, I will take a closer look soon.

exceptionfactory avatar Jul 10 '24 15:07 exceptionfactory

@exceptionfactory any idea when you can get to this?

MikeThomsen avatar Jul 16 '24 15:07 MikeThomsen

Thanks @exceptionfactory. I'll try to get a fast turn around on this

MikeThomsen avatar Jul 16 '24 20:07 MikeThomsen

@exceptionfactory @jrsteinebrey Changes should be good now. I changed the "attribute" configuration option to be an "attribute prefix" option. I don't think this version of the AWS Java SDK supports retrieving the ACL from the object metadata, but both raw and user-provided metadata is attached at the user-specified prefix.

MikeThomsen avatar Aug 08 '24 21:08 MikeThomsen

@exceptionfactory @jrsteinebrey should be good now.

MikeThomsen avatar Aug 09 '24 13:08 MikeThomsen

@MikeThomsen Thanks again for these new processors. I am fine with them being merged if the workflow actions succeed (they are still running now) and if @exceptionfactory (or any committer or higher) approves of it being merged.

jrsteinebrey avatar Aug 09 '24 14:08 jrsteinebrey

@exceptionfactory I played around a bit with that and liked the idea. Should be GTG now.

MikeThomsen avatar Aug 09 '24 19:08 MikeThomsen

@exceptionfactory should be good now.

MikeThomsen avatar Aug 10 '24 01:08 MikeThomsen

@exceptionfactory do you think you'll have time to look at this today?

MikeThomsen avatar Aug 12 '24 15:08 MikeThomsen

Thanks for the updates @MikeThomsen, I should have time to review again within the next few days. At a quick glance, the latest code changes look good, so it looks like it should be close to completion following some runtime testing.

exceptionfactory avatar Aug 12 '24 16:08 exceptionfactory

Thanks @exceptionfactory. I'll try to get this completed tonight.

MikeThomsen avatar Aug 13 '24 22:08 MikeThomsen

@exceptionfactory done and build started.

MikeThomsen avatar Aug 14 '24 18:08 MikeThomsen