internetarchive
internetarchive copied to clipboard
Retry search at current cursor on total:None
This is more of a workaround for an issue I've run into on search paging results. I'm not sure why the client is sporadically receiving an unexpected response -- it seems like a problem with the search API which might need investigation.
Getting the following error occasionally on ia search
with largish result sets (eg: 92669) part way through:
Traceback (most recent call last):
File "/home/charles/.local/bin/ia", line 8, in <module>
sys.exit(main())
File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/cli/ia.py", line 171, in main
sys.exit(ia_module.main(argv, session))
File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/cli/ia_search.py", line 100, in main
for result in search:
File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/search.py", line 279, in __next__
return self.iterator.__next__()
File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/search.py", line 153, in _scrape
num_found = int(j['total'])
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Example API JSON response (from adding some debug code to capture the unexpected response):
UNEXPECTED RESPONSE: {'items': [], 'count': 0, 'previous': 'W3siaWRlbn....REDACTEDjustincase...X0tONjgzIn1d', 'total': None}
This was occurring sporadically for me over the weekend at 10K multiples on various queries from 30K to 90K num_results
.
The query I was using is admittedly a bit complex:
ia search "mediatype:texts AND boxid:IA* AND scribe3_search_catalog:isbn AND identifier:t*" -f "scribe3_search_id" -p scope:all
But I have in the past run 'worse' ones with more clauses, fields, and expected results without problems :)
I haven't tested yet whether the scope:all
contributes to the buggy behaviour. I started filtering on identifier first letter to reduce the size of the queries.
The code change here simply checks total
before assuming it is an int
, and retries the 10K limit request at the current cursor if total
is None
-- which seems an unexpected result. Current code looks to be expecting either 0
or a positive integer.
This was enough to let me complete a number of large search tasks.
Possible issues:
- This could get stuck in a loop -- there's no give-up clause if the API decides to consistently return
total: None
- I only checked very briefly whether 0 search results avoid this retry code -- searching for non existent categories with 0 results does not get stuck in a loop, but there may be other cases which need checking?
-
total:None
looks like an API bug which would be masked by this workaround without understanding of the root cause.
I'm not a huge fan of this band-aid fix, especially when considering the possibility of the infinite None loop. But without taking a deeper look at the pagination code I can't suggest a better solution either.
Thanks for the PR @hornc, and thanks for taking a look, maxz!
I think we should retry a few times when we run into total: None, and then throw a ReadTimeout exception like we do here if all retries fail/return None for total. I think we should back off for 1 second on each retry, based on feedback from our search team. We should probably also expose these values (retry count, back off delay) via optional parameters.
@hornc I'm happy to make these changes if you'd like. I have a few other projects that need tending to, but hopefully I can get to it soon! Thanks again.
Thanks @jjjake , that sounds great!
@maxz, thanks for the review. I'll remove my comment line, but adding the back off etc and doing it properly will be the better solution. Happy to close this in favour of a complete solution from @jjjake.
@mekarpeles This was the behavior we were seeing on Friday.
First off, I want to ➕ 1 @jjjake's proposal of having some sort of retry w/ exponential backoff.
However, today the existing solution is worse than the PR that @hornc offers in that it throws exceptions up stream.
I would support merging @hornc's changes to unbreak code that relies on this functionality and then also support an effort which adds retries + backoff and warnings.
This problem can still occur on v3.2.0 -- I had been using a older patched version with the fix in this PR daily (without exp. backoff, and without problems), but recently upgraded.
It's a sporadic issue, but I've seen it under v3.2.0. I've now upgraded to v3.5.0. If it occurs again I'll rebase this PR and see about adding a backoff.