botocore
botocore copied to clipboard
Enable ExclusiveStartKey = None for first call to table.Query
When making a query against DynamoDB which gives back results in batches, for the 2nd and later calls, you are supposed to pass in the ExclusiveStartKey set to response['LastEvaluatedKey'] from the last batch to have the query continue for the next batch. For the first batch, it would be nice to be able to pass in ExclusiveStartKey = None.
However the AWS (Python API) doesn't support that option and it gives an error message about an invalid ExclusiveStartKey.
(File "/usr/lib/python3.7/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter ExclusiveStartKey, value: None, type: <class 'NoneType'>, valid types: <class 'dict'>)
I also tried passing in an empty dict but it didn't like that either.
It would be nice if it could just silently ignore that value if set to None, because it won't be properly set till the 2nd and later calls. Otherwise it seems I have to write separate calls to Query with and without that one parameter and it makes the code more complicated.
Here is some simple logic (that should work, but fails currently):
startKey = None
while True:
response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=XXX)
#Process batch
startkey = response.get('LastEvaluatedKey', None)
startkey is None:
break
None
could be semantically important, so if we allowed providing it then it could lead to breaking changes down the road.
You can also use paginators. They're not super well integrated into the Table resource, but you could do:
paginator = table.meta.client.get_paginator('query')
page_iterator = paginator.paginate(TableName='table', KeyConditionExpression=XXX)
for page in page_iterator:
pass
I didn't consider None as being a valid ExclusiveStartKey. I could see searching for None values in the condition expression. Will both be needed? I believe the alternative is something like this. You would want to put your process batch logic into a function to prevent copy and paste code.
response = table.query(KeyConditionExpression=key)
startkey = response.get('LastEvaluatedKey', None)
ProcessBatch(response)
while startkey is not None:
response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=key)
startkey = response.get('LastEvaluatedKey', None)
ProcessBatch(response)
You could also write something like this:
response = table.query(KeyConditionExpression=key)
first = True
while first or startkey is not None:
if not first:
response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=key)
first = False
startkey = response.get('LastEvaluatedKey', None)
ProcessBatch(response)
It gets a little convoluted for a simple scenario. Ideally this API can be easier than SQL.
Not sure im following why you have to do all that, why doesn't phjordon's suggestion work? just replace pass with ProcessBatch(page).
That workaround of using the paginators API could work, but this bug is about having a more practical, easy to use API for Table.Query. I presume it is an important API. Also, no one has explained or perhaps thought about why ExclusiveStartKey set to None can't be a signal to start from the beginning. What other meaning could that have? Maybe that issue should be decided, and then people can figure out what to do about this bug.
I agree 100% with the OP. The error message even says that the only valid input is a dict, so why not allow None as a sentinel to eliminate the need for two (nearly identical) queries?
I agree with OP as well, I was looking for a way to do exactly what he suggests, and found this open issue.
I also agree that None should be an accepted option for ExclusiveStartKey. Doing so would eliminate a bunch of boilerplate code across many projects, sounds like.
Running across the same issue here and agree with OP's solution. My immediate inclination was to code out something exactly along the lines of the example in the OP's original post. Working with the supplied paginators seems to require using the low-level clients and produces a different result format.
Also having this issue and would love to be able to pass in None
, or an empty dictionary or something.
+1 - passing in an empty dict at the very least seems like proper API here.
+1 for the paginators approach from @JordonPhillips. Here is a more complete example
dbClient = boto3.client('dynamodb')
queryPgr = dbClient.get_paginator('query')
page_iterator = queryPgr.paginate(
TableName = 'endpoints',
KeyConditionExpression = 'userid = :u',
ExpressionAttributeValues = {
':u': {'S': 'whatever'}
}
)
results = []
for page in page_iterator:
results.extend(page['Items'])
print(results)
EDIT: on second thought, the above way results in having to deal with the serializer from boto3.dynamodb.types. To avoid that, I ended up making a query_all function to deal with this like so:
dbResource = boto3.resource('dynamodb')
endpointsTable = dbResource.Table('endpoints')
def query_all(table, query):
to_return = []
try:
q = table.query(**query)
to_return.extend(q['Items'])
while 'LastEvaluatedKey' in q:
query['ExclusiveStartKey'] = q['LastEvaluatedKey']
q = table.query(**query)
to_return.extend(q['Items'])
except:
traceback.print_exc()
return to_return
result = query_all(endpointsTable, {
'KeyConditionExpression': 'userid = :u',
'ExpressionAttributeValues': {
':u': 'whatever'
}
})
print(result)
notice now we are dealing with boto3.resource rather than boto3.client which makes life easier since it handles the typing automatically.
Adding ability to pass None as the start key should be no brainer... Would simplify the query code. In general boto3 code is much less user friendly than for example pymongo. Programing for mongo is nice experience while programing for dynamodb is royal pain in the butt... :(
Another example is 'Limit'... Why limit equal to 0 could not mean 'no limit' ? Programmers life would be too easy then i guess...
I agree that being able to pass a semaphore (like None) as the initial ExclusiveStartKey would help new boto users avoid code duplication. However, you can use Python's **kwargs feature to achieve the same result:
dynamodb = boto3.resource( 'dynamodb' )
table = dynamodb.Table( 'MyTable' )
kwargs = {
'KeyConditionExpression': Key( 'myKey' ).eq( myKey ),
'FilterExpression': Attr( 'myAttr' ).eq( myAttr )
}
while True:
response = table.query( **kwargs )
processItems( response[ u'Items' ] )
if 'LastEvaluatedKey' in response:
kwargs[ 'ExclusiveStartKey' ] = response[ 'LastEvaluatedKey' ]
else:
break
After all of this time, I understand people don't want to break their existing coding flow, but paginators seem to be really nice, and @grintor 's example worked well for me.
It is really simple, and seems clean because there is only a single line in the for loop, instead of the entire query.
If botocore insists None
should be reserved, I would recommend to provide some botocore native constant value (or a customer configurable constant when creating the client) that represents the ignorable values. For example:
class IGNORABLE:
def __bool__(self):
return False
Then when processing API arguments, like here, botocore client can make a deep copy of the arguments to exclude the ignorable values.
There will be some trade-off for deep copy:
- If do not do deep copy, but directly remove the argument
kwargs
, then the argument is mutated, which will surprises users if the argument is reused. - If do deep copy, then it is not efficient especially when the argument takes huge memory.
- A trade off between 1 & 3 is to do deep copy only when the ignorable constant exists in the argument, or when users explicitly set some flag in the client.
Enable ExclusiveStartKey = None for first call to table.Query
+1, accept None value make this parameter human facing. Otherwise the code will look like shit when using ExclusiveStartKey:
if nextToken:
response = table.query(TableName=TABLE_NAME,
Select='ALL_ATTRIBUTES',
Limit=limit,
ExclusiveStartKey=nextToken,
KeyConditionExpression=Key('id').eq(id) & Key('time').between(startTime, endTime))
else:
response = table.query(TableName=TABLE_NAME,
Select='ALL_ATTRIBUTES',
Limit=limit,
KeyConditionExpression=Key('id').eq(id) & Key('time').between(startTime, endTime))
Anyway, parameter design should consider the users, regardless of whether they are reserved or not. There is no need to provide elaborate justifications, simply make the necessary changes.
If using a paginator is the way to do pagination, then why do we have the LEK/ESK stuff at all? Using it needlessly complicates what should be very simple code, leading to duplication and potential bugs.
+1 This is just ridiculous. Here is my workaround with iterators:
def iterate_query(table: dynamo.Table, **query_params):
response = None
page_params = {}
while not response or page_params.get("ExclusiveStartKey"):
response = table.query(**query_params, **page_params)
page_params["ExclusiveStartKey"] = response.get("LastEvaluatedKey")
yield response
The function can be used like this:
for page in iterate_query(table, **query_params):
items = page["Items"]
...
Personally, I don't really care about ExclusiveStartKey
. There should be a straightforward pythonic way how to do this without thinking about it too much. I am sorry, but paginators look like something you would need in Java. Ideally, I would like to call something like this:
for page in table.iter_query(**my_query_params):
items = page["Items"]
...