aws-sdk-pandas Quicksight create_athena_dataset method has a default name for SQL query which breaks the functional for created datasets

At the moment awswrangler.quicksight.create_athena_dataset assigns the default name to SQL query - 'CustomSQL'. In this setup sql_name argument becomes optional, and at least I'd expect that if I don't care about sql_name as a user, the library will handle the name for me in a way that there won't be any problems.

In reality, if one creates several datasets with the same name for SQL query, it causes validation conflicts in AWS such as “Custom SQL tables cannot have the same alias but duplicates were found. All table names: \“CustomSQL\“”. It becomes very annoying because both awswrangler and QuickSight allow to create the dataset but then AWS doesn't allow some operations on it, for example joining it with other datasets. It becomes even worse when QuickSight doesn't show the reason of an error. Sometimes it just says that "something is wrong", and then one has to investigate, what's going on.

To avoid ambiguity in expectations, I see a couple of options:

Remove the default 'CustomSQL' name for the sql_name argument, and specify in documentation that it has to be unique for everything to work correctly.
Fail fast when creating dataset with duplicated sql_name and explicitly tell the user that it's not going to work.
Make sql_name "truly" not important by randomizing the 'default' name to avoid conflicts.

I see pros and cons for every option. Removing the default 'CustomSQL' name is a breaking change, if smb didn't use it before (I doubt that it's possible for extensive use but still) they'll have to update the code. Fail fast might become complex, since we'd need to check wether a SQL with the same name already exists. Also I don't see uniqueness requirement in AWS documentation, which makes the responsibilities of awswrangler bigger if we want it to handle this case. The third option is tempting, it doesn't make additional inference about QuickSight behaviour while making names slightly random and unique. But then I'm not quite sure what the best practices for this kind of randomization, and wouldn't it just make already implicit issue even more implicit.

I appreciate your thoughts, and any other suggestions. I can try to implement the change if decision will be made.

Aug 09 '22 11:08 AndreyKudryavets

Did you consider overriding wr.quicksight.create_athena_dataset(..., sql_name="my-name")? A default name is there for convenience but you still have full control over the name via sql_name parameter and can randomize / provide meaningful query names for your queries

Aug 15 '22 13:08 kukushking

well, obviously there is no problem if you don't use the default argument :)

What I'm saying is that having a default value for this argument is at least questionable, and in our case it was harmful. We relied on it, figured that it doesn't work after a few hours of work, spent some time trying to understand what hit us, and then spent some more time fixing it. I think all the hustle could have be avoided if this default wouldn't exist or if it'd be reliable.

Aug 15 '22 14:08 AndreyKudryavets

Fair enough, I would prefer option 1 but that would be breaking change indeed, so we may consider it for 3.0. Option 3 is also acceptable e.g. appending uuid to make query names unique.

Ideally though, it should be fixed on QuickSight API. If query names are indeed expected to be unique, the API should throw an exception. I will try to reach out to QuickSight team to see if they are aware of this.

Aug 15 '22 15:08 kukushking

Makes sense, thanks! Let me know if you need a hand with implementation.

Aug 15 '22 18:08 AndreyKudryavets

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Quicksight create_athena_dataset method has a default name for SQL query which breaks the functional for created datasets

aws-sdk-pandas
aws-sdk-pandas copied to clipboard