AWS GlueCatalogHook doesn't support custom CatalogId
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon
- version 8.28.0
Apache Airflow version
2.10.1
Operating System
MWAA
Deployment
Amazon (AWS) MWAA
Deployment details
Vanilla Deployment
What happened
The current GlueCatalogHook doesn't pass the CatalogId property during boto3 calls as seen from here:
What you think should happen instead
There should be a way to pass the CatalogId as there will be users that will need to pass the CatalogId.
- This happened to my use case at work.
How to reproduce
Try to target a Glue database and table that has an associated CatalogId where the CatalogId is not the default AWS AccountId and all operations will fail.
Anything else
I was able to have a workaround by copying the implementation of the actual GlueCatalogHook and changing our sensors to use this ExtendedGlueCatalogHook where we add the CatalogId to the calls, example:
def get_partitions(
self,
catalog_id: str,
database_name: str,
table_name: str,
expression: str = "",
page_size: int | None = None,
max_items: int | None = None,
) -> set[tuple]:
...
response = paginator.paginate(
CatalogId=catalog_id, <=============== This should be added as an optional parameter
DatabaseName=database_name, TableName=table_name, Expression=expression, PaginationConfig=config
)
partitions = set()
for page in response:
for partition in page["Partitions"]:
partitions.add(tuple(partition["Values"]))
return partitions
...
If anyone from the AWS team is going to work on this one, I'm also part of Amazon and you reach reach me (keds@) and I can show you what we did on this one.
Thanks!
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
go ahed.. :)
I can have a look at this some time next week if @keeed doesn't?
I asked Devin to take a look into this issue, the solution it came up with looks good to me. The old unit tests + new unit tests Devin wrote passed as well. Created a PR with the changes, would be interested in a review: https://github.com/apache/airflow/pull/44800
Link to Devin run: https://app.devin.ai/sessions/f6e5706fdebf47cb8cafcb44e8dd3ccb
Prs should come from real person GitHub account, not from boats.
cc: @potiuk
Actually it can. https://www.apache.org/legal/generative-tooling.html:
Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:
- The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition.
- At least one of the following conditions is met:
- The output is not copyrightable subject matter (and would not be even if produced by a human).
- No third party materials are included in the output.
- Any third party materials that are included in the output are being used with permission (e.g., under a compatible open-source license) of the third party copyright holders and in compliance with the applicable license terms.
- A contributor obtains reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about output that may be similar to training data, or from code scanning results.
@olsenbudanur -> can you confirm (and somewhat explain how) those conditions are met ?
Ah okay :)
@potiuk looking into these conditions right now, will send an update soon
I've contacted the Devin (Cognition) team, and they confirmed that they reviewed the Apache terms & open-source terms. They say everything looks fine (can tag someone from their team if needed)
1- The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition
According to Devin's terms, all outputs are fully owned by the user. No copyright and no third party materials.
2- At least one of the following conditions is met The output is not copyrightable subject matter. Also no third party materials are used.
3- I am certain that this is not applicable.
@potiuk is there anything I'm missing?
According to Devin's terms, all outputs are fully owned by the user. No copyright and no third party materials.
This is wrong. For example If they are using GPL code to train their code GPL puts restriction on redistribution of that code. It's not the "ownership" of the code it's the licence restrictions that are put on the code.
Oh I see. I'll try to pull someone from the Cognition team in this thread, they should have a better understanding of this than I do
BTW. If Devin wants to check if they follow the licence, they can write a question and explain what they do and ask for clarification via mechanisms described at https://www.apache.org/legal/#communications
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.