sagerx
sagerx copied to clipboard
augment and streamline export functionality
Resolves #338
Explanation
- Created local
exportsfolder within Airflow (and updating.gitignoreand anddocker_compose.ymlaccordingly) so that data marts are exported locally, in addition to on S3 (if destination bucket URI is specified). - Added
s3fslibrary torequirements.txtto allow for more streamlined writing of files to S3 (described below) - Leveraged the built-in functionality of
pandas(whens3fsis installed) to write files (CSV, Parquet, etc.) directly from a DataFrame to an S3 bucket (no need to write to local file then copy to bucket). export_martsDAG writes to.parquetformat by default (but I left commented-out lines for writing to CSV for easy editing, if desired).- Leveraged
pandasdefault use of environment variables for AWS authentication (no need to explicitly pull the keys from the environment then pass them to boto/pandas/etc).- Amended
docker_compose.ymlfile to remove unnecessary references to AWS environment variables. Simply naming them appropriately in the .env file (example below) is all that's required.
- Amended
# example .env file
AIRFLOW_UID=your_uid
UMLS_API=your_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_DEST_BUCKET=s3://bucket_name/folder_name
AWS_REGION=your_aws_region
Tests
Successfully ran export_marts DAG in Airflow, noting the files were written properly to both the local exports directory, as well as an S3 bucket.