cloudtrail-parquet-glue
cloudtrail-parquet-glue copied to clipboard
Initialization help
In order to use this project, is it sufficient to create the 3 new S3 buckets and then fill in the values in the variables.tf
file and then run terraform apply
? That is what I did and am not having success. The terraform command ran successfully and output glue_workflow_id = CloudTrailParquetGlue
. I can see I have an Athena database named cloudtrail
and a table named after my S3 bucket that contains my original cloudtrail logs raw_CLOUDTRAIL_BUCKET
(where CLOUDTRAIL_BUCKET
is the name of my bucket). I have two Glue crawlers CloudTrailParquetCrawler
and CloudTrailRawCrawler
which I ran manually and they seem to be successful as their logs look like this:
I then ran the ETL job CloudTrailToParquet
and got the following error message:
AnalysisException: u'Partition column `day` not found in schema
My parquet
and temp
S3 buckets are empty.
I am running this against an S3 bucket that I had copied CloudTrail logs to, but they retain their directory structure.
My guess is maybe this is because the partitions are being created with the names partition_0
instead of day
due to this line: https://github.com/alsmola/cloudtrail-parquet-glue/blob/91adbe4aff8c815d7e79a64a3a17b8d116e8761f/scripts/glue_etl.py#L15
I tried changing the glue_etl.py
file to use the names account
instead of partition_2
, etc. but this only resulted in a new error of AnalysisException: u'Partition column
account not found in schema
. Also the Athena table still looks the same with those partition_0
column names.
@0xdabbad00 You might need to recreate the table in order to re-partition it, or use the schema updates configuration in the crawler properties.
If you're still in development phase, that might be better to terraform destroy
and then terraform apply
again.
Any further comment on what's causing this reported issues? It seems like even cursory testing of the deployment in this repo should reveal this as a flaw (or not).
I've been troubleshooting this a bit and was successful after editing the partition names (manually) in the glue console to awslogs, account, cloudtrail, region, year, month, day. Haven't been successful with any edits in the TF module but still playing around.
Line 15 of glue_etl.py, you have to change the field mappings map "partition_0" to "awslogs," etc. Once I changed the below, the entire workflow completed successfully.
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("eventversion", "string", "eventversion", "string"), ("useridentity", "struct", "useridentity", "struct"), ("eventtime", "string", "eventtime", "string"), ("eventsource", "string", "eventsource", "string"), ("eventname", "string", "eventname", "string"), ("awsregion", "string", "awsregion", "string"), ("sourceipaddress", "string", "sourceipaddress", "string"), ("useragent", "string", "useragent", "string"), ("requestparameters", "struct", "requestparameters", "string"), ("responseelements", "struct", "responseelements", "string"), ("requestid", "string", "requestid", "string"), ("eventid", "string", "eventid", "string"), ("eventtype", "string", "eventtype", "string"), ("recipientaccountid", "string", "recipientaccountid", "string"), ("resources", "array", "resources", "array"), ("sharedeventid", "string", "sharedeventid", "string"), ("errorcode", "string", "errorcode", "string"), ("errormessage", "string", "errormessage", "string"), ("apiversion", "string", "apiversion", "string"), ("readonly", "boolean", "readonly", "boolean"), ("additionaleventdata", "struct", "additionaleventdata", "string"), ("vpcendpointid", "string", "vpcendpointid", "string"), ("managementevent", "boolean", "managementevent", "boolean"), ("eventcategory", "string", "eventcategory", "string"), ("serviceeventdetails", "struct", "serviceeventdetails", "struct"), ("partition_0", "string", "awslogs", "string"), ("partition_1", "string", "account", "string"), ("partition_2", "string", "cloudtrail", "string"), ("partition_3", "string", "region", "string"), ("partition_4", "string", "year", "string"), ("partition_5", "string", "month", "string"), ("partition_6", "string", "day", "string")], transformation_ctx = "applymapping1")
Just submitted a pull request that should solve this issue.