sagerx icon indicating copy to clipboard operation
sagerx copied to clipboard

augment and streamline export functionality

Open coussens opened this issue 10 months ago • 0 comments

Resolves #338

Explanation

  • Created local exports folder within Airflow (and updating .gitignore and and docker_compose.yml accordingly) so that data marts are exported locally, in addition to on S3 (if destination bucket URI is specified).
  • Added s3fs library to requirements.txt to allow for more streamlined writing of files to S3 (described below)
  • Leveraged the built-in functionality of pandas (when s3fs is installed) to write files (CSV, Parquet, etc.) directly from a DataFrame to an S3 bucket (no need to write to local file then copy to bucket).
  • export_marts DAG writes to .parquet format by default (but I left commented-out lines for writing to CSV for easy editing, if desired).
  • Leveraged pandas default use of environment variables for AWS authentication (no need to explicitly pull the keys from the environment then pass them to boto/pandas/etc).
    • Amended docker_compose.yml file to remove unnecessary references to AWS environment variables. Simply naming them appropriately in the .env file (example below) is all that's required.
# example .env file
AIRFLOW_UID=your_uid
UMLS_API=your_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_DEST_BUCKET=s3://bucket_name/folder_name
AWS_REGION=your_aws_region

Tests

Successfully ran export_marts DAG in Airflow, noting the files were written properly to both the local exports directory, as well as an S3 bucket.

coussens avatar Jan 15 '25 19:01 coussens