nifi
nifi copied to clipboard
NIFI-13430 Added CopyS3Object and ExistsS3Object
Summary
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
- [ ] Apache NiFi Jira issue created
Pull Request Tracking
- [ ] Pull Request title starts with Apache NiFi Jira issue number, such as
NIFI-00000 - [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such
NIFI-00000
Pull Request Formatting
- [ ] Pull Request based on current revision of the
mainbranch - [ ] Pull Request refers to a feature branch with one commit containing changes
Verification
Please indicate the verification steps performed prior to pull request creation.
Build
- [ ] Build completed using
mvn clean install -P contrib-check- [ ] JDK 21
Licensing
- [ ] New dependencies are compatible with the Apache License 2.0 according to the License Policy
- [ ] New dependencies are documented in applicable
LICENSEandNOTICEfiles
Documentation
- [ ] Documentation formatting appears as expected in rendered files
@exceptionfactory Alright. I'll open a PR for this functionality on support/nifi-1.x and then start refactoring this PR to use AWS 2.X APIs. Does that work for you?
@exceptionfactory Alright. I'll open a PR for this functionality on support/nifi-1.x and then start refactoring this PR to use AWS 2.X APIs. Does that work for you?
I recommend refactoring first, as the AWS SDK 2 libraries are also available on the support branch. Working through the changes on the main branch first should limit the amount of work necessary to consider back-porting.
Ok. Will do. I will probably have to massively expand the scope of this PR but it is what it is...
@exceptionfactory I did some experimentation with this today, and there is going to be a lot of refactoring work in both the abstract processors module and the aws processors module to get the S3 stuff migrated. I'd rather add these with the V1 API since they're thin wrappers over that API and then do a bigger ticket afterward to try to get everything on the V2 API.
I see your point, but I'd rather get this functionality in now so it's available because we have some use cases that really need it. I don't think it'll add much to that refactoring task.
Thanks for the follow up @MikeThomsen.
Incorporating new Processors and refactoring in the same pull request for S3 doesn't sound ideal, and I agree that the scope of these Processors is fairly narrow in terms of S3 API changes. With that in mind, considering these Processors as they stand seems like a reasonable way forward.
To that end, there are a few things I recommend evaluating. The Copy Processor seems straightforward, and I can take a closer look at potential changes there. I would note that new properties should just use a human-readable name() instead of a different displayName() going forward.
On the Exists Processor, I think it would be better to have it function as a GetS3ObjectMetadata Processor. The doesObjectExist() method requests the metadata, but ignores the response. I think having the metadata would present other potential uses beyond just whether the object exists, while still supporting that particular use case.
strong +1 on David's comments regarding GetS3ObjectMetadata instead of an exists call.
I'll take a look into that @joewitt @exceptionfactory but it might be right to have 3 processors instead. I recall something about how doesObjectExists uses a HEAD which is faster for S3 than trying to fetch the metadata when your only goal is to ask whether the object exists. I'll look into it and adjust accordingly.
I'll take a look into that @joewitt @exceptionfactory but it might be right to have 3 processors instead. I recall something about how doesObjectExists uses a HEAD which is faster for S3 than trying to fetch the metadata when your only goal is to ask whether the object exists. I'll look into it and adjust accordingly.
@MikeThomsen The doesObjectExist() method calls getObjectMetadata(), so there should be no difference in performance. That's the reason I suggested going that direction, thus I would not recommend a separate processor for checking object existence.
@jrsteinebrey @exceptionfactory we should be GTG now on the changes.
@jrsteinebrey @exceptionfactory we should be GTG now on the changes.
Thanks for the updates @MikeThomsen, I will take a closer look soon.
@exceptionfactory any idea when you can get to this?
Thanks @exceptionfactory. I'll try to get a fast turn around on this
@exceptionfactory @jrsteinebrey Changes should be good now. I changed the "attribute" configuration option to be an "attribute prefix" option. I don't think this version of the AWS Java SDK supports retrieving the ACL from the object metadata, but both raw and user-provided metadata is attached at the user-specified prefix.
@exceptionfactory @jrsteinebrey should be good now.
@MikeThomsen Thanks again for these new processors. I am fine with them being merged if the workflow actions succeed (they are still running now) and if @exceptionfactory (or any committer or higher) approves of it being merged.
@exceptionfactory I played around a bit with that and liked the idea. Should be GTG now.
@exceptionfactory should be good now.
@exceptionfactory do you think you'll have time to look at this today?
Thanks for the updates @MikeThomsen, I should have time to review again within the next few days. At a quick glance, the latest code changes look good, so it looks like it should be close to completion following some runtime testing.
Thanks @exceptionfactory. I'll try to get this completed tonight.
@exceptionfactory done and build started.