dataall icon indicating copy to clipboard operation
dataall copied to clipboard

Dataset import fails with "Database already exists"

Open voidwisp opened this issue 1 year ago • 1 comments

Describe the bug

We had a user who tried importing a glue database which is encrypted. When data.all creates the dataset stack it runs a custom lambda resource for the glue database.

This lambda had the following exception inside the log: /aws/lambda/dataall-gluedb-lf-handler-aztj3kw0

[DEBUG]	2024-07-29T04:52:55.843Z	c4ed10d5-ebae-4b7c-8d89-0f4a3f687f99	Response body:
b'
{
    "__type": "GlueEncryptionException",
    "Message": "User: arn:aws:sts::REDACTED:assumed-role/dataallPivotRole-cdk/dataall-gluedb-lf-handler-REDACTED is not authorized to perform: kms:Decrypt on resource: arn:aws:kms:us-east-1:REDACTED:key/REDACTED because no identity-based policy allows the kms:Decrypt action (Service: AWSKMS; Status Code: 400; Error Code: AccessDeniedException; Request ID: REDACTED; Proxy: null)"
}
'

Because of this exception lambda assumes database does not exist and tries to create it.

How to Reproduce

Encrypt a glue DB with KMS CMK key and attempt to import a dataset with it.

Expected behavior

When checking if a glue database exists in a custom resource CF lambda fails then the lambda should exit with a correct failure and not attempt to create a database.

Your project

No response

Screenshots

No response

OS

N/A

Python version

N/A

AWS data.all version

2.6

Additional context

No response

voidwisp avatar Jul 29 '24 12:07 voidwisp

Thanks for opening this issue @zsaltys - looking at /backend/dataall/modules/s3_datasets_shares/aws/glue_client.py::get_glue_database (method used by Custom Resource checks if DB exists before creating or not) we have the following

        try:
            log.info(f'Getting database {self._database} ' f'in account {self._account_id}...')
            database = self._client.get_database(CatalogId=self._account_id, Name=self._database)
            return database
        except ClientError:
            log.info(f'Database {self._database} not found in account {self._account_id}')
            return False

It is a good catch - we should be handling this exception more granularly to avoid such circumstances as the one described above where we default assume DB exist on any exception

We will continue looking into this issue and see how best to split up error handling to prevent this scenario from repeatedly happening

noah-paige avatar Aug 02 '24 22:08 noah-paige