botocore
botocore copied to clipboard
Paginator Resume Token always None?
Hi. I presume I'm mis-understanding something in regard to the use of the paginators, because I didn't find any pre-existing issue on this topic (I could only find a Closed PR #88 mentioning the resume_token).
So, I'm trying to use a paginator for a DynamoDB scan, and capturing (after I iterate over each page) the resume_token value (so that, in my 'real-world' scenario, I can use it if needed to start a new paged iteration).
Here's a snippet:
paginator = ddb_client.get_paginator('scan')
# I won't bore you with the paginate_args, but they don't include StartingToken
page_iterator = paginator.paginate(**paginate_args)
print "Paginating over results of Scan:"
for (index, page) in enumerate(page_iterator):
print "Page #{0!s}:".format(index)
items = page['Items']
print "\tNumber of Items: {0!s}".format(len(items))
print "\tResume_Token: {0!s}".format(page_iterator.resume_token)
I'm seeing this loop over all pages available, but for each page it shows Resume_Token: None
.
Let's say I would like to grab the token so that, under certain circumstances (like a timeout), I can start off a new scanning operation from the spot that this scan operation reached.
Is that not supported? Am I stuck using the LastEvaluatedKey value from the page object instead?
I'm using:
- Python 2.7
- Mac OS X El Capitan (10.11.4)
- botocore 1.4.49
- boto3 1.4.0
resume_token
is currently only set once MaxItems
has been reached:
import boto3
ddb = boto3.client('dynamodb')
# Setup an expected response
stubber = Stubber(ddb)
last_evaluated_key = {"data": {"S": "value"}}
service_response = {
"ScannedCount": 1,
"Count": 1,
"Items": [last_evaluated_key],
"LastEvaluatedKey": last_evaluated_key
}
stubber.add_response('scan', service_response)
# Scan for a single item, using MaxItems to make sure resume_token is set
paginator = ddb.get_paginator('scan')
pages = paginator.paginate(TableName='testing-table', PaginationConfig={
'MaxItems': 1, 'PageSize': 1})
with stubber:
for page in pages:
pass
# Print paging arguments
print(page.get('LastEvaluatedKey'))
print(pages.resume_token)
For now your only option in that case is the LastEvaluatedKey
which is unfortunate.
We could set it every time, but I'm not comfortable doing that because the setter would negatively impact performance (possibly noticeably, especially in the dynamodb case).
We could also potentially set it only in error cases, so this would work:
import boto3
from botocore.exceptions import ClientError
ddb = boto3.client('dynamodb')
# Setup an expected response
stubber = Stubber(ddb)
last_evaluated_key = {"data": {"S": "value"}}
service_response = {
"ScannedCount": 1,
"Count": 1,
"Items": [last_evaluated_key],
"LastEvaluatedKey": last_evaluated_key
}
stubber.add_response('scan', service_response)
# Setup an error
stubber.add_client_error('scan')
# Scan for two items, where the second call will throw an exception
paginator = ddb.get_paginator('scan')
pages = paginator.paginate(TableName='testing-table', PaginationConfig={
'MaxItems': 2, 'PageSize': 1})
with stubber:
try:
list(pages)
except ClientError:
print(pages.resume_token)
I'm leaning towards that second option. There is still the possibility that you will not be able to use that resume_token
as the tokens for some services do expire. What do you think?
Thanks for looking into this,
and confirming resume_token
is set only under certain conditions.
I understand the worry about negative performance impact, and I'm certainly not in a position to judge that better than you. The ability to grab the resume_token
upon error may be a partial solution, but I don't think it will help my case (we need an emoji of a sad panda).
So, here's my use-case, for reference, even if it involves lots of AWS stuff beyond botocore: I have a serverless app in AWS (APIGateway into Lambdas written in python, using DynamoDB for storage). I also have Lambdas kicking in on a (cron-based, CloudWatchEvents schedule), say once a day for certain tasks. One of them is to look at all the entities (in a certain DynamoDB table) matching a condition, and - if they match another condition checked in code - delete them.
Easy, but the scheduled Lambda has to abide by the Timeout limit (eg: 5 minutes), so I was thinking of making it retrieve the records via paginator
, process the Items
in each page, keep track of the restore_token
as each page is processed. If the time remaining to the Lambda is less than a certain threshold (say: 10 seconds), spawn off a new instance of the Lambda (via async invoke), passing it the value of the last resume_token
(and then let the first instance of the Lambda die).
If restore_token
is not updated after each page is iterated, then I might have to use the LastEvaluatedKey
instead, but that puts more burden on our app, as it must ensure we retrieve the records in some "monotonic" order that is based on the key, and of course the keys may be more complex than a single value, and different from table to table, and the whole thing works only if the data I pass from one Lambda to the "next" is serializable/picklable in some form, etc etc...
There are a couple additional "gotchas" that I ran into (unless I'm missing something else).
- When using
LastEvaluatedKey
, you must ensure thatPaginationConfig['MaxItems']
is evenly divisible byPaginationConfig['PageSize']
. If it is not, in the last retrieved page theCount
/ScannedCount
will not be equal tolen(page['Items'])
, andpage['LastEvaluatedKey']
will not be consistent with the final item in the Items list. It's not clear to me why (in the absence ofScanFilter
) that those would not be equal, but that is the case for the last page when MaxItems mod PageSize != 0. I ran into this because I was using a heuristic to figure out "about how many items I could process in time_interval", and I was missing data on the boundaries. It's not clear if this is intentional or not, and it's not clear ifrestore_token
would have the same behavior or not. -
Using
page['LastEvaluatedKey']
forces you to use theClient.scan()
interface (asExclusiveStartKey={'field': {'S': 'value'}}
(i.e., you can't use thePaginator.Scan.paginate()
interface, because there's nowhere to bootstrap apaginate()
with something likeLastEvaluatedKey
). That is,paginate()
expects thepaginator.restore_token
, a string token rather than a DDB attribute-value dict as returned byLastEvaluatedKey
. As an aside, the docs don't even say thatpaginate()
's response includesLastEvaluatedKey
and as far as I can tell it is not useful. An example is here: https://github.com/boto/boto3/issues/954
Had the same issue as @FrancescoRizzi - the solution I found was using page['Marker']
- which is simply the key of the last S3 object in the page.
You can also pass in any S3 object key for PaginationConfig={ 'StartingToken': 'path_to\last_handled_s3_object_key' }
setting the raw value per page iteration to resume_value and then making resume_token a property that serializes on access would address the speed concern @JordonPhillips .. The lambda use case re time limited processsing is pretty common, the existing usage of resume_token seems quite strange in context of those use cases, where its not set till iteration is over. i've done work arounds in pagination class, but its fairly invasive, and it feels like it should be part of the pagination api imo.
Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.
still an issue sadly, was just reviewing my workarounds on this.
fwiw, the diff is fairly minimal just changing where the response yielding is to be delayed post update of paginator instance variables instead of eager.
49d48
< yield response
51a51
> log.info('next-token %s', next_token)
52a53
> yield response
58a60
> yield response
66a69
> yield response