aws-sdk-pandas
aws-sdk-pandas copied to clipboard
wr.dynamodb.read_partiql_query does not actually use PartiQL
Is your idea related to a problem? Please describe.
The wr.dynamodb.read_partiql_query function does not actually use PartiQL and instead does a full table scan, ignoring anything defined in the WHERE clause which is misleading.
Describe the solution you'd like
- Docs highlighting this limitation, or renaming this function to not reference
partiql - (preferably) Use the PartiQL boto3 method
execute_statement
This is something I may be able to contribute, but I'd first like to see if there's reasoning for the scan
@Dilski Thanks for opening this, I agree for PartiQL we definitely should use execute_statement instead of scan - didn't really notice this before. This also is in line with the best practices.
@Dilski Thanks for offering to contribute. If you aren't able to for any reason please unassign yourself and someone on the team can pick this up.
Thank you @malachi-constant , I'll give this a go on Friday.
If I change this function to use partiql instead of scan, the dynamodb:PartiQLSelect IAM permission will be required instead of scan. That's potentially a breaking change, so would you prefer the scan implementation is kept (with a different name) - or just replace it?
I would personally prefer a new method with an explicit and appropriate name (i.e. execute_statement)
My vote would go to making the breaking change (i.e. using execute_statement) but releasing it as part of version 3.0.0 and not 2.17
I'd lean towards breaking change for the 3.0 release, as it looks like there's a 3.0 in the works? I've been away with the flu so haven't had time to work on this yet, but will do when I can
I'll add my perspective here. I wrote a PartiQL query in the AWS Dashboard to pull a few hundred rows from my 50M row table, and dropped it into AwsWrangler expecting the same results. Instead I got a full table scan. This API is misnamed and something should prevent new users from running it against much larger datasets than mine expecting it to actually do what it says it will do.
I understand wanting to avoid breaking changes, but this API is already broken, I think?
Hi @ddrinka have you tried using the method implemented in #1824 ?
It's not in one of our release candidates for 3.x.x yet, but can be installed with pip install git+https://github.com/aws/[email protected].
@aws/aws-sdk-pandas Do we want to consider adding a warning here for this method in 2.x.x?
@malachi-constant Yes, the function works great in the 3.0.0 branch.
I just wanted to make sure y'all had the perspective from a new user of this library that I was surprised and alarmed with the current behavior. Including a note in the top level of the documentation or printing a warning if there are more characters after SELECT * FROM table would have helped me do more research first before pointing it at my data.
@ddrinka I think that's completely fair. I think we'll move forward with a warning.
After internal discussion we have decided to implement the change as a breaking change before the 3.0 release. It means it will be available in the next minor release 2.20.0 and a breaking change message regarding the IAM permission required will be mentioned in the release notes