aws-sdk-pandas
aws-sdk-pandas copied to clipboard
`DescribeTable` call hidden in `read_items`
I'm using read_items but inside that call is another API call. Can this be avoided?
https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_read.py#L617
To:
https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_utils.py#L22
To:
https://github.com/aws/aws-sdk-pandas/blob/6c0f65b6b63b223bec1059ecd037697b068f7e63/awswrangler/dynamodb/_utils.py#L42
When I enable debug logging (boto3.set_stream_logger(name='botocore') and call table.key_schema I see:
2023-07-18 09:19:42,633 botocore.hooks [DEBUG] Event request-created.dynamodb.DescribeTable: calling handler <function add_retry_headers at 0x1a316ec20>
This means every read_items call is doing this extra lookup. Could an option be added to provide the key_schema explicitly? Only so many items can be provided to read_items (100 max) so it's guaranteed that for someone using this library they're going to make this extra call 1% of their requests - if they're doing massive read traffic in the millions this adds up.
Hey,
More than 100 items can be returned by read_items, so this extra call to DescribeTable shouldn't happen that frequently.
However, I do see the case for allowing a customer to provide key_schema manually. I will discuss this with the rest of the team.
Best regards, Leon
@mooreniemi can you expand on this please:
Only so many items can be provided to read_items (100 max)
as Leon mentioned read_items is not limited in size and returns as much as the call reads
@mooreniemi can you expand on this please:
Only so many items can be provided to read_items (100 max)
as Leon mentioned
read_itemsis not limited in size and returns as much as the call reads
Jump in just to (hopefully) clarify what @mooreniemi is referring to.
The usage of wr.dynamodb.read_items is actually bounded by the intrinsic limit of underlying boto3 client iff explicitly provided with a list of partition/sort values with more than 100 items, as stated here
A single operation can retrieve up to 16 MB of data, which can contain as many as 100 items.
A possible workaround is to implement a sort of manual pagination, and this might be addressed directly into aws-sdk-pandas itself, something like
CHUNK_SIZE = 100
partition_values = ... # list with more than 100 items
counter = 0
items = []
for _ in itertools.count():
_partition_values = partition_values[counter : counter + CHUNK_SIZE]
if _partition_values:
items.extend(
wr.dynamodb.read_items(
table_name=table_name,
partition_values=_partition_values,
)
)
counter += CHUNK_SIZE
else:
break