botocore icon indicating copy to clipboard operation
botocore copied to clipboard

Paginator Resume Token always None?

Open FrancescoRizzi opened this issue 7 years ago • 7 comments

Hi. I presume I'm mis-understanding something in regard to the use of the paginators, because I didn't find any pre-existing issue on this topic (I could only find a Closed PR #88 mentioning the resume_token).

So, I'm trying to use a paginator for a DynamoDB scan, and capturing (after I iterate over each page) the resume_token value (so that, in my 'real-world' scenario, I can use it if needed to start a new paged iteration).

Here's a snippet:

paginator = ddb_client.get_paginator('scan')
# I won't bore you with the paginate_args, but they don't include StartingToken
page_iterator = paginator.paginate(**paginate_args)

print "Paginating over results of Scan:"
for (index, page) in enumerate(page_iterator):
    print "Page #{0!s}:".format(index)

    items = page['Items']
    print "\tNumber of Items: {0!s}".format(len(items))
    print "\tResume_Token: {0!s}".format(page_iterator.resume_token)

I'm seeing this loop over all pages available, but for each page it shows Resume_Token: None. Let's say I would like to grab the token so that, under certain circumstances (like a timeout), I can start off a new scanning operation from the spot that this scan operation reached.

Is that not supported? Am I stuck using the LastEvaluatedKey value from the page object instead?

I'm using:

  • Python 2.7
  • Mac OS X El Capitan (10.11.4)
  • botocore 1.4.49
  • boto3 1.4.0

FrancescoRizzi avatar Sep 28 '16 17:09 FrancescoRizzi

resume_token is currently only set once MaxItems has been reached:

import boto3


ddb = boto3.client('dynamodb')

# Setup an expected response
stubber = Stubber(ddb)
last_evaluated_key = {"data": {"S": "value"}}
service_response = {
    "ScannedCount": 1,
    "Count": 1,
    "Items": [last_evaluated_key],
    "LastEvaluatedKey": last_evaluated_key
}
stubber.add_response('scan', service_response)

# Scan for a single item, using MaxItems to make sure resume_token is set
paginator = ddb.get_paginator('scan')
pages = paginator.paginate(TableName='testing-table', PaginationConfig={
    'MaxItems': 1, 'PageSize': 1})

with stubber:
    for page in pages:
        pass

# Print paging arguments
print(page.get('LastEvaluatedKey'))
print(pages.resume_token)

For now your only option in that case is the LastEvaluatedKey which is unfortunate.

We could set it every time, but I'm not comfortable doing that because the setter would negatively impact performance (possibly noticeably, especially in the dynamodb case).

We could also potentially set it only in error cases, so this would work:

import boto3
from botocore.exceptions import ClientError


ddb = boto3.client('dynamodb')

# Setup an expected response
stubber = Stubber(ddb)
last_evaluated_key = {"data": {"S": "value"}}
service_response = {
    "ScannedCount": 1,
    "Count": 1,
    "Items": [last_evaluated_key],
    "LastEvaluatedKey": last_evaluated_key
}
stubber.add_response('scan', service_response)

# Setup an error
stubber.add_client_error('scan')

# Scan for two items, where the second call will throw an exception
paginator = ddb.get_paginator('scan')
pages = paginator.paginate(TableName='testing-table', PaginationConfig={
    'MaxItems': 2, 'PageSize': 1})

with stubber:
    try:
        list(pages)
    except ClientError:
        print(pages.resume_token)

I'm leaning towards that second option. There is still the possibility that you will not be able to use that resume_token as the tokens for some services do expire. What do you think?

JordonPhillips avatar Sep 28 '16 20:09 JordonPhillips

Thanks for looking into this, and confirming resume_token is set only under certain conditions.

I understand the worry about negative performance impact, and I'm certainly not in a position to judge that better than you. The ability to grab the resume_token upon error may be a partial solution, but I don't think it will help my case (we need an emoji of a sad panda).

So, here's my use-case, for reference, even if it involves lots of AWS stuff beyond botocore: I have a serverless app in AWS (APIGateway into Lambdas written in python, using DynamoDB for storage). I also have Lambdas kicking in on a (cron-based, CloudWatchEvents schedule), say once a day for certain tasks. One of them is to look at all the entities (in a certain DynamoDB table) matching a condition, and - if they match another condition checked in code - delete them.

Easy, but the scheduled Lambda has to abide by the Timeout limit (eg: 5 minutes), so I was thinking of making it retrieve the records via paginator, process the Items in each page, keep track of the restore_token as each page is processed. If the time remaining to the Lambda is less than a certain threshold (say: 10 seconds), spawn off a new instance of the Lambda (via async invoke), passing it the value of the last resume_token (and then let the first instance of the Lambda die).

If restore_token is not updated after each page is iterated, then I might have to use the LastEvaluatedKey instead, but that puts more burden on our app, as it must ensure we retrieve the records in some "monotonic" order that is based on the key, and of course the keys may be more complex than a single value, and different from table to table, and the whole thing works only if the data I pass from one Lambda to the "next" is serializable/picklable in some form, etc etc...

FrancescoRizzi avatar Sep 28 '16 21:09 FrancescoRizzi

There are a couple additional "gotchas" that I ran into (unless I'm missing something else).

  • When using LastEvaluatedKey, you must ensure that PaginationConfig['MaxItems'] is evenly divisible by PaginationConfig['PageSize']. If it is not, in the last retrieved page the Count/ScannedCount will not be equal to len(page['Items']), and page['LastEvaluatedKey'] will not be consistent with the final item in the Items list. It's not clear to me why (in the absence of ScanFilter) that those would not be equal, but that is the case for the last page when MaxItems mod PageSize != 0. I ran into this because I was using a heuristic to figure out "about how many items I could process in time_interval", and I was missing data on the boundaries. It's not clear if this is intentional or not, and it's not clear if restore_token would have the same behavior or not.
  • Using page['LastEvaluatedKey'] forces you to use the Client.scan() interface (as ExclusiveStartKey={'field': {'S': 'value'}} (i.e., you can't use the Paginator.Scan.paginate() interface, because there's nowhere to bootstrap a paginate() with something like LastEvaluatedKey). That is, paginate() expects the paginator.restore_token, a string token rather than a DDB attribute-value dict as returned by LastEvaluatedKey. As an aside, the docs don't even say that paginate()'s response includes LastEvaluatedKey and as far as I can tell it is not useful. An example is here: https://github.com/boto/boto3/issues/954

brianfaull avatar Apr 20 '17 06:04 brianfaull

Had the same issue as @FrancescoRizzi - the solution I found was using page['Marker'] - which is simply the key of the last S3 object in the page.

You can also pass in any S3 object key for PaginationConfig={ 'StartingToken': 'path_to\last_handled_s3_object_key' }

budowski avatar Jan 27 '18 22:01 budowski

setting the raw value per page iteration to resume_value and then making resume_token a property that serializes on access would address the speed concern @JordonPhillips .. The lambda use case re time limited processsing is pretty common, the existing usage of resume_token seems quite strange in context of those use cases, where its not set till iteration is over. i've done work arounds in pagination class, but its fairly invasive, and it feels like it should be part of the pagination api imo.

kapilt avatar Apr 24 '19 11:04 kapilt

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

github-actions[bot] avatar Aug 24 '21 21:08 github-actions[bot]

still an issue sadly, was just reviewing my workarounds on this.

fwiw, the diff is fairly minimal just changing where the response yielding is to be delayed post update of paginator instance variables instead of eager.

49d48
<                 yield response
51a51
>                 log.info('next-token %s', next_token)
52a53
>                     yield response
58a60
>                     yield response
66a69
>                 yield response

kapilt avatar Oct 19 '21 23:10 kapilt