astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Explore possibility to make better experience with redshift operator

Open pankajastro opened this issue 1 year ago • 4 comments

Please describe the feature you'd like to see Loading to redshift from s3 seems more complicated than traditional Amazon S3toRedshift operator and also run into the header issue with it. https://www.notion.so/astronomerio/ML-Batch-Inference-with-Astro-SDK-Issues-Summary-0703c2fe6abf485da514b45e5db9651f

Relevant link : https://astronomer.slack.com/archives/C02B8SPT93K/p1663942705854959

Describe the solution you'd like A clear and concise description of what you want to happen, if possible with code examples.

Are there any alternatives to this feature? Is there another way we could solve this problem or enable this use-case?

Additional context Add any other context about the feature request here.

Acceptance Criteria

  • [ ] All checks and tests in the CI should pass
  • [ ] Unit tests (90% code coverage or more, once available)
  • [ ] Integration tests (if the feature relates to a new database or external service)
  • [ ] Example DAG
  • [ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • [ ] Exception handling in case of errors
  • [ ] Logging (are we exposing useful information to the user? e.g. source and destination)
  • [ ] Improve the documentation (README, Sphinx, and any other relevant)
  • [ ] How to use Guide for the feature (example)

pankajastro avatar Sep 25 '22 10:09 pankajastro

@pankajkoti Any update on this? Can you spend some time tomorrow or on Monday to see if you can take care of this, please?

kaxil avatar Oct 06 '22 21:10 kaxil

Hi @kaxil , yes I had taken a look at it. The Amazon S3toRedshift operator takes 2 connections https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/transfers/s3_to_redshift.py#L79, one is for Redshift Hook and another for S3 Hook. In our base class implementation, we only have 1 connection i.e. for Redshift. The S3Hook has get_credentials method which gives IAM based key-pair and token and then we do not need IAM role to be explicitly passed. I was thinking how should we accept another connection ID (for S3Hook) here: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/aws/redshift.py#L73 so that we can use something like this: https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/transfers/s3_to_redshift.py#L131.

Our base class implementation just has conn_id. Should I add another param s3_conn_id here https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/aws/redshift.py#L73 ?

pankajkoti avatar Oct 07 '22 06:10 pankajkoti