cloudtrail-parquet-glue icon indicating copy to clipboard operation
cloudtrail-parquet-glue copied to clipboard

Initialization help

Open 0xdabbad00 opened this issue 4 years ago • 7 comments

In order to use this project, is it sufficient to create the 3 new S3 buckets and then fill in the values in the variables.tf file and then run terraform apply? That is what I did and am not having success. The terraform command ran successfully and output glue_workflow_id = CloudTrailParquetGlue. I can see I have an Athena database named cloudtrail and a table named after my S3 bucket that contains my original cloudtrail logs raw_CLOUDTRAIL_BUCKET (where CLOUDTRAIL_BUCKET is the name of my bucket). I have two Glue crawlers CloudTrailParquetCrawler and CloudTrailRawCrawler which I ran manually and they seem to be successful as their logs look like this: image

I then ran the ETL job CloudTrailToParquet and got the following error message:

AnalysisException: u'Partition column `day` not found in schema

image

My parquet and temp S3 buckets are empty.

I am running this against an S3 bucket that I had copied CloudTrail logs to, but they retain their directory structure.

0xdabbad00 avatar Aug 27 '20 18:08 0xdabbad00

My guess is maybe this is because the partitions are being created with the names partition_0 instead of day due to this line: https://github.com/alsmola/cloudtrail-parquet-glue/blob/91adbe4aff8c815d7e79a64a3a17b8d116e8761f/scripts/glue_etl.py#L15

image

0xdabbad00 avatar Aug 27 '20 18:08 0xdabbad00

I tried changing the glue_etl.py file to use the names account instead of partition_2, etc. but this only resulted in a new error of AnalysisException: u'Partition column account not found in schema. Also the Athena table still looks the same with those partition_0 column names.

0xdabbad00 avatar Aug 27 '20 19:08 0xdabbad00

@0xdabbad00 You might need to recreate the table in order to re-partition it, or use the schema updates configuration in the crawler properties. If you're still in development phase, that might be better to terraform destroy and then terraform apply again.

avishayil avatar Sep 06 '20 12:09 avishayil

Any further comment on what's causing this reported issues? It seems like even cursory testing of the deployment in this repo should reveal this as a flaw (or not).

BigDataDaddy avatar Jan 08 '21 18:01 BigDataDaddy

I've been troubleshooting this a bit and was successful after editing the partition names (manually) in the glue console to awslogs, account, cloudtrail, region, year, month, day. Haven't been successful with any edits in the TF module but still playing around.

andrew-kline avatar Mar 07 '21 23:03 andrew-kline

Line 15 of glue_etl.py, you have to change the field mappings map "partition_0" to "awslogs," etc. Once I changed the below, the entire workflow completed successfully.

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("eventversion", "string", "eventversion", "string"), ("useridentity", "struct", "useridentity", "struct"), ("eventtime", "string", "eventtime", "string"), ("eventsource", "string", "eventsource", "string"), ("eventname", "string", "eventname", "string"), ("awsregion", "string", "awsregion", "string"), ("sourceipaddress", "string", "sourceipaddress", "string"), ("useragent", "string", "useragent", "string"), ("requestparameters", "struct", "requestparameters", "string"), ("responseelements", "struct", "responseelements", "string"), ("requestid", "string", "requestid", "string"), ("eventid", "string", "eventid", "string"), ("eventtype", "string", "eventtype", "string"), ("recipientaccountid", "string", "recipientaccountid", "string"), ("resources", "array", "resources", "array"), ("sharedeventid", "string", "sharedeventid", "string"), ("errorcode", "string", "errorcode", "string"), ("errormessage", "string", "errormessage", "string"), ("apiversion", "string", "apiversion", "string"), ("readonly", "boolean", "readonly", "boolean"), ("additionaleventdata", "struct", "additionaleventdata", "string"), ("vpcendpointid", "string", "vpcendpointid", "string"), ("managementevent", "boolean", "managementevent", "boolean"), ("eventcategory", "string", "eventcategory", "string"), ("serviceeventdetails", "struct", "serviceeventdetails", "struct"), ("partition_0", "string", "awslogs", "string"), ("partition_1", "string", "account", "string"), ("partition_2", "string", "cloudtrail", "string"), ("partition_3", "string", "region", "string"), ("partition_4", "string", "year", "string"), ("partition_5", "string", "month", "string"), ("partition_6", "string", "day", "string")], transformation_ctx = "applymapping1")

andrew-kline avatar Mar 08 '21 12:03 andrew-kline

Just submitted a pull request that should solve this issue.

andrew-kline avatar Mar 08 '21 22:03 andrew-kline