airflow-dbt-python icon indicating copy to clipboard operation
airflow-dbt-python copied to clipboard

[Feature] Define separate remote for dbt artifact upload

Open pixie79 opened this issue 2 years ago • 3 comments

Hi,

Thanks for this project it looks great and I am looking to switch over to using it. One thing on the docs in order to keep the speed on downloads I will probably zip my project dir. At the moment however, I do use s3 for my docs but I would like to use a different bucket to the one used for pulling airflow and dbt resources. Is there an override for the command below to change the upload bucket?

Thanks

    dbt_docs = DbtDocsGenerateOperator(
        task_id="dbt_docs",
        project_dir="s3://my-bucket/dbt/project/key/prefix/",
        profiles_dir="s3://my-bucket/dbt/profiles/key/prefix/",
    )

pixie79 avatar Jan 30 '23 14:01 pixie79

If I do the above with a zip file for the project it does correctly generate the docs as far as I can tell, but then attempts to overwrite my zip file on s3 which is not great as that would then get overwritten again as part of my CI/CD process from github.

2023-02-21, 15:22:13 UTC] {dbt.py:289} INFO - Pushing dbt project to: s3://XXXX-data-airflow/dbt-project.zip
[2023-02-21, 15:22:13 UTC] {base.py:88} INFO - Pushing dbt project files to: s3://XXXX-data-airflow/dbt-project.zip
[2023-02-21, 15:22:13 UTC] {s3.py:243} INFO - Loading file /tmp/airflowtmpmnxiz710/.temp.zip to S3: dbt-project.zip
[2023-02-21, 15:22:13 UTC] {base_aws.py:130} INFO - No connection ID provided. Fallback on boto3 credential strategy (region_name=None). See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
[2023-02-21, 15:22:14 UTC] {s3.py:256} WARNING - Failed to load dbt-project.zip: key already exists in S3.
[2023-02-21, 15:22:16 UTC] {taskinstance.py:1318} INFO - Marking task as SUCCESS. dag_id=dbt_docs_generate, task_id=gl_hourly, execution_date=20230221T151529, start_date=20230221T151531, end_date=20230221T152216
[2023-02-21, 15:22:16 UTC] {local_task_job.py:208} INFO - Task exited with return code 0

Either way, as you can see, the ZIP push failed, but the job passed, which is incorrect as it should fail. Ideally, I need to be able to set the upload to a different location or bucket so that the write can succeed.

pixie79 avatar Feb 21 '23 15:02 pixie79

There is currently no way to override the upload destination: we only support uploading to the same key from where we downloaded the project.

You could -in theory, I haven't tried this- push the documentation artifacts to XCOM (via do_xcom_push_artifacts) and then have a follow-up task pick them up and send them to your different S3 bucket. But XCOM (at least the default backend) wasn't designed to store the heavy dbt documentation artifacts, so this is not ideal.

From airflow-dbt-python's perspective, I don't see any reason not to support this: it's a matter of having the time to implement the feature. I would make it generic enough so that we can override the upload destination of all dbt artifacts, not just those generated by dbt docs, perhaps with a new argument artifact_remote_url or something.

Or changing do_xcom_push_artifacts to a more generic upload_artifacts and having XCom be one of the options for remote uploads.

If you are up to taking a stab at this (or you have already done it) I can review the PR. Otherwise I may have time to do this (but can't promise a timeline).

Thanks for reporting this issue!

tomasfarias avatar Mar 05 '23 15:03 tomasfarias

Thanks for that, I did take a look but cant really see where to do this correctly.

For now I will try via the XCOM and hope you are able to find time as some point to update.

Thank you for your work :)

pixie79 avatar Mar 16 '23 08:03 pixie79