aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

wr.dynamodb.read_partiql_query does not actually use PartiQL

Open Dilski opened this issue 3 years ago • 7 comments

Is your idea related to a problem? Please describe. The wr.dynamodb.read_partiql_query function does not actually use PartiQL and instead does a full table scan, ignoring anything defined in the WHERE clause which is misleading.

Describe the solution you'd like

  • Docs highlighting this limitation, or renaming this function to not reference partiql
  • (preferably) Use the PartiQL boto3 method execute_statement

Dilski avatar Sep 05 '22 11:09 Dilski

This is something I may be able to contribute, but I'd first like to see if there's reasoning for the scan

Dilski avatar Sep 05 '22 12:09 Dilski

@Dilski Thanks for opening this, I agree for PartiQL we definitely should use execute_statement instead of scan - didn't really notice this before. This also is in line with the best practices.

kukushking avatar Sep 07 '22 09:09 kukushking

@Dilski Thanks for offering to contribute. If you aren't able to for any reason please unassign yourself and someone on the team can pick this up.

malachi-constant avatar Sep 07 '22 14:09 malachi-constant

Thank you @malachi-constant , I'll give this a go on Friday.

If I change this function to use partiql instead of scan, the dynamodb:PartiQLSelect IAM permission will be required instead of scan. That's potentially a breaking change, so would you prefer the scan implementation is kept (with a different name) - or just replace it?

Dilski avatar Sep 07 '22 16:09 Dilski

I would personally prefer a new method with an explicit and appropriate name (i.e. execute_statement)

malachi-constant avatar Sep 07 '22 18:09 malachi-constant

My vote would go to making the breaking change (i.e. using execute_statement) but releasing it as part of version 3.0.0 and not 2.17

jaidisido avatar Sep 08 '22 14:09 jaidisido

I'd lean towards breaking change for the 3.0 release, as it looks like there's a 3.0 in the works? I've been away with the flu so haven't had time to work on this yet, but will do when I can

Dilski avatar Sep 14 '22 10:09 Dilski

I'll add my perspective here. I wrote a PartiQL query in the AWS Dashboard to pull a few hundred rows from my 50M row table, and dropped it into AwsWrangler expecting the same results. Instead I got a full table scan. This API is misnamed and something should prevent new users from running it against much larger datasets than mine expecting it to actually do what it says it will do.

I understand wanting to avoid breaking changes, but this API is already broken, I think?

ddrinka avatar Jan 24 '23 21:01 ddrinka

Hi @ddrinka have you tried using the method implemented in #1824 ?

It's not in one of our release candidates for 3.x.x yet, but can be installed with pip install git+https://github.com/aws/[email protected].

@aws/aws-sdk-pandas Do we want to consider adding a warning here for this method in 2.x.x?

malachi-constant avatar Jan 26 '23 17:01 malachi-constant

@malachi-constant Yes, the function works great in the 3.0.0 branch.

I just wanted to make sure y'all had the perspective from a new user of this library that I was surprised and alarmed with the current behavior. Including a note in the top level of the documentation or printing a warning if there are more characters after SELECT * FROM table would have helped me do more research first before pointing it at my data.

ddrinka avatar Jan 26 '23 18:01 ddrinka

@ddrinka I think that's completely fair. I think we'll move forward with a warning.

malachi-constant avatar Jan 26 '23 18:01 malachi-constant

After internal discussion we have decided to implement the change as a breaking change before the 3.0 release. It means it will be available in the next minor release 2.20.0 and a breaking change message regarding the IAM permission required will be mentioned in the release notes

jaidisido avatar Jan 26 '23 18:01 jaidisido