aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

`DescribeTable` call hidden in `read_items`

Open mooreniemi opened this issue 2 years ago • 3 comments

I'm using read_items but inside that call is another API call. Can this be avoided?

https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_read.py#L617

To:

https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_utils.py#L22

To:

https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_utils.py#L42

When I enable debug logging (boto3.set_stream_logger(name='botocore') and call table.key_schema I see:

2023-07-18 09:19:42,633 botocore.hooks [DEBUG] Event request-created.dynamodb.DescribeTable: calling handler <function add_retry_headers at 0x1a316ec20>

This means every read_items call is doing this extra lookup. Could an option be added to provide the key_schema explicitly? Only so many items can be provided to read_items (100 max) so it's guaranteed that for someone using this library they're going to make this extra call 1% of their requests - if they're doing massive read traffic in the millions this adds up.

mooreniemi avatar Jul 18 '23 13:07 mooreniemi

Hey,

More than 100 items can be returned by read_items, so this extra call to DescribeTable shouldn't happen that frequently.

However, I do see the case for allowing a customer to provide key_schema manually. I will discuss this with the rest of the team.

Best regards, Leon

LeonLuttenberger avatar Jul 18 '23 16:07 LeonLuttenberger

@mooreniemi can you expand on this please:

Only so many items can be provided to read_items (100 max)

as Leon mentioned read_items is not limited in size and returns as much as the call reads

jaidisido avatar Jul 19 '23 09:07 jaidisido

@mooreniemi can you expand on this please:

Only so many items can be provided to read_items (100 max)

as Leon mentioned read_items is not limited in size and returns as much as the call reads

Jump in just to (hopefully) clarify what @mooreniemi is referring to.

The usage of wr.dynamodb.read_items is actually bounded by the intrinsic limit of underlying boto3 client iff explicitly provided with a list of partition/sort values with more than 100 items, as stated here

A single operation can retrieve up to 16 MB of data, which can contain as many as 100 items.

A possible workaround is to implement a sort of manual pagination, and this might be addressed directly into aws-sdk-pandas itself, something like

CHUNK_SIZE = 100

partition_values = ... # list with more than 100 items
counter = 0
items = []

for _ in itertools.count():
    _partition_values = partition_values[counter : counter + CHUNK_SIZE]
    if _partition_values:
        items.extend(
            wr.dynamodb.read_items(
                table_name=table_name,
                partition_values=_partition_values,
            )
        )
        counter += CHUNK_SIZE
    else:
        break

a-slice-of-py avatar Jul 28 '23 10:07 a-slice-of-py