botocore icon indicating copy to clipboard operation
botocore copied to clipboard

Enable ExclusiveStartKey = None for first call to table.Query

Open KeithCu opened this issue 5 years ago • 19 comments

When making a query against DynamoDB which gives back results in batches, for the 2nd and later calls, you are supposed to pass in the ExclusiveStartKey set to response['LastEvaluatedKey'] from the last batch to have the query continue for the next batch. For the first batch, it would be nice to be able to pass in ExclusiveStartKey = None.

However the AWS (Python API) doesn't support that option and it gives an error message about an invalid ExclusiveStartKey.

(File "/usr/lib/python3.7/site-packages/botocore/validate.py", line 297, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter ExclusiveStartKey, value: None, type: <class 'NoneType'>, valid types: <class 'dict'>)

I also tried passing in an empty dict but it didn't like that either.

It would be nice if it could just silently ignore that value if set to None, because it won't be properly set till the 2nd and later calls. Otherwise it seems I have to write separate calls to Query with and without that one parameter and it makes the code more complicated.

Here is some simple logic (that should work, but fails currently):

startKey = None
while True:
    response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=XXX)
    #Process batch
    startkey = response.get('LastEvaluatedKey', None)
    startkey is None:
        break

KeithCu avatar Mar 01 '19 08:03 KeithCu

None could be semantically important, so if we allowed providing it then it could lead to breaking changes down the road.

You can also use paginators. They're not super well integrated into the Table resource, but you could do:

paginator = table.meta.client.get_paginator('query')
page_iterator = paginator.paginate(TableName='table', KeyConditionExpression=XXX)
for page in page_iterator:
    pass

JordonPhillips avatar Mar 05 '19 19:03 JordonPhillips

I didn't consider None as being a valid ExclusiveStartKey. I could see searching for None values in the condition expression. Will both be needed? I believe the alternative is something like this. You would want to put your process batch logic into a function to prevent copy and paste code.

    response = table.query(KeyConditionExpression=key)
    startkey = response.get('LastEvaluatedKey', None)
    ProcessBatch(response)

    while startkey is not None:
        response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=key)
        startkey = response.get('LastEvaluatedKey', None)
        ProcessBatch(response)

You could also write something like this:

    response = table.query(KeyConditionExpression=key)
    first = True

    while first or startkey is not None:
        if not first:
            response = table.query(ExclusiveStartKey = startkey, KeyConditionExpression=key)

        first = False
        startkey = response.get('LastEvaluatedKey', None)
        ProcessBatch(response)

It gets a little convoluted for a simple scenario. Ideally this API can be easier than SQL.

KeithCu avatar Mar 06 '19 03:03 KeithCu

Not sure im following why you have to do all that, why doesn't phjordon's suggestion work? just replace pass with ProcessBatch(page).

stealthycoin avatar Jul 10 '19 23:07 stealthycoin

That workaround of using the paginators API could work, but this bug is about having a more practical, easy to use API for Table.Query. I presume it is an important API. Also, no one has explained or perhaps thought about why ExclusiveStartKey set to None can't be a signal to start from the beginning. What other meaning could that have? Maybe that issue should be decided, and then people can figure out what to do about this bug.

KeithCu avatar Jul 11 '19 00:07 KeithCu

I agree 100% with the OP. The error message even says that the only valid input is a dict, so why not allow None as a sentinel to eliminate the need for two (nearly identical) queries?

rhiller avatar Jul 16 '19 21:07 rhiller

I agree with OP as well, I was looking for a way to do exactly what he suggests, and found this open issue.

joshorr avatar Aug 06 '19 14:08 joshorr

I also agree that None should be an accepted option for ExclusiveStartKey. Doing so would eliminate a bunch of boilerplate code across many projects, sounds like.

aghalbert avatar Sep 16 '19 16:09 aghalbert

Running across the same issue here and agree with OP's solution. My immediate inclination was to code out something exactly along the lines of the example in the OP's original post. Working with the supplied paginators seems to require using the low-level clients and produces a different result format.

WongPlaceRightTime avatar Oct 02 '19 19:10 WongPlaceRightTime

Also having this issue and would love to be able to pass in None, or an empty dictionary or something.

norrishd avatar Feb 05 '20 03:02 norrishd

+1 - passing in an empty dict at the very least seems like proper API here.

mafellows avatar Feb 15 '20 18:02 mafellows

+1 for the paginators approach from @JordonPhillips. Here is a more complete example

dbClient = boto3.client('dynamodb')
queryPgr = dbClient.get_paginator('query')

page_iterator = queryPgr.paginate(
    TableName = 'endpoints',
    KeyConditionExpression = 'userid = :u',
    ExpressionAttributeValues = {
        ':u': {'S': 'whatever'}
    }
)

results = []
for page in page_iterator:
    results.extend(page['Items'])

print(results)

EDIT: on second thought, the above way results in having to deal with the serializer from boto3.dynamodb.types. To avoid that, I ended up making a query_all function to deal with this like so:

dbResource = boto3.resource('dynamodb')
endpointsTable = dbResource.Table('endpoints')

def query_all(table, query):
    to_return = []
    try:
        q = table.query(**query)
        to_return.extend(q['Items'])
        while 'LastEvaluatedKey' in q:
            query['ExclusiveStartKey'] = q['LastEvaluatedKey']
            q = table.query(**query)
            to_return.extend(q['Items'])
    except:
        traceback.print_exc()
    return to_return

result = query_all(endpointsTable, {
    'KeyConditionExpression': 'userid = :u',
    'ExpressionAttributeValues': {
        ':u': 'whatever'
    }
})

print(result)

notice now we are dealing with boto3.resource rather than boto3.client which makes life easier since it handles the typing automatically.

grintor avatar Feb 16 '20 19:02 grintor

Adding ability to pass None as the start key should be no brainer... Would simplify the query code. In general boto3 code is much less user friendly than for example pymongo. Programing for mongo is nice experience while programing for dynamodb is royal pain in the butt... :(

ccie18643 avatar Mar 06 '20 15:03 ccie18643

Another example is 'Limit'... Why limit equal to 0 could not mean 'no limit' ? Programmers life would be too easy then i guess...

ccie18643 avatar Mar 06 '20 15:03 ccie18643

I agree that being able to pass a semaphore (like None) as the initial ExclusiveStartKey would help new boto users avoid code duplication. However, you can use Python's **kwargs feature to achieve the same result:

dynamodb = boto3.resource( 'dynamodb' )
table = dynamodb.Table( 'MyTable' )
kwargs = {
    'KeyConditionExpression': Key( 'myKey' ).eq( myKey ),
    'FilterExpression': Attr( 'myAttr' ).eq( myAttr )
}
while True:
    response = table.query( **kwargs )
    processItems( response[ u'Items' ] )
    if 'LastEvaluatedKey' in response:
        kwargs[ 'ExclusiveStartKey' ] = response[ 'LastEvaluatedKey' ]
    else:
        break

PatonLewis avatar Mar 16 '20 18:03 PatonLewis

After all of this time, I understand people don't want to break their existing coding flow, but paginators seem to be really nice, and @grintor 's example worked well for me.

It is really simple, and seems clean because there is only a single line in the for loop, instead of the entire query.

kornpow avatar Oct 15 '21 16:10 kornpow

If botocore insists None should be reserved, I would recommend to provide some botocore native constant value (or a customer configurable constant when creating the client) that represents the ignorable values. For example:

class IGNORABLE:
    def __bool__(self):
        return False

Then when processing API arguments, like here, botocore client can make a deep copy of the arguments to exclude the ignorable values.

There will be some trade-off for deep copy:

  1. If do not do deep copy, but directly remove the argument kwargs, then the argument is mutated, which will surprises users if the argument is reused.
  2. If do deep copy, then it is not efficient especially when the argument takes huge memory.
  3. A trade off between 1 & 3 is to do deep copy only when the ignorable constant exists in the argument, or when users explicitly set some flag in the client.

kaiwensun avatar Jul 15 '22 05:07 kaiwensun

Enable ExclusiveStartKey = None for first call to table.Query

+1, accept None value make this parameter human facing. Otherwise the code will look like shit when using ExclusiveStartKey:

        if nextToken:
            response = table.query(TableName=TABLE_NAME,
                                   Select='ALL_ATTRIBUTES',
                                   Limit=limit,
                                   ExclusiveStartKey=nextToken,
                                   KeyConditionExpression=Key('id').eq(id) & Key('time').between(startTime, endTime))
        else:
            response = table.query(TableName=TABLE_NAME,
                                   Select='ALL_ATTRIBUTES',
                                   Limit=limit,
                                   KeyConditionExpression=Key('id').eq(id) & Key('time').between(startTime, endTime))

Anyway, parameter design should consider the users, regardless of whether they are reserved or not. There is no need to provide elaborate justifications, simply make the necessary changes.

ufosaga avatar Jun 29 '23 15:06 ufosaga

If using a paginator is the way to do pagination, then why do we have the LEK/ESK stuff at all? Using it needlessly complicates what should be very simple code, leading to duplication and potential bugs.

vin avatar Oct 17 '23 16:10 vin

+1 This is just ridiculous. Here is my workaround with iterators:

def iterate_query(table: dynamo.Table, **query_params):
    response = None
    page_params = {}

    while not response or page_params.get("ExclusiveStartKey"):
        response = table.query(**query_params, **page_params)
        page_params["ExclusiveStartKey"] = response.get("LastEvaluatedKey")
        yield response

The function can be used like this:

for page in iterate_query(table, **query_params):
    items = page["Items"]
    ...

Personally, I don't really care about ExclusiveStartKey. There should be a straightforward pythonic way how to do this without thinking about it too much. I am sorry, but paginators look like something you would need in Java. Ideally, I would like to call something like this:


for page in table.iter_query(**my_query_params):
    items = page["Items"]
    ...

poncovka avatar Apr 11 '24 07:04 poncovka