openverse-api
openverse-api copied to clipboard
Dead link filtering reduces number of queryable pages
Description
The dead link filtering algorithm significantly limits the pages that can be retrieved.
Reproduction
- Go to audio search endpoint:
http://localhost:8000/v1/audio/. Note the number of pages. - Go to a page number that's quite high but less than 250, like 200:
http://localhost:8000/v1/audio/?page=200 - See the deep pagination error. This should not be happening.
Expectation
Page count calculation should be correct and should allow pagination upto the stated limit.
Resolution
- [ ] 🙋 I would be interested in resolving this bug.
One way to fix this would be to check if result_count is divisible by page_size. If not, increment page_count by 1.
Updated this issue to actually reflect the second half of the bug report, the first being resolved in #172.
HI @dhruvkb Can you please provide more details about this bug?
A brief search shows that api / controllers / search_controller is file where the bug is.
Methods
- _paginate_with_dead_link_mask
- _get_query_slice
Is this correct?
@ritesh-pandey I added the steps to reproduce. You're right about the bug being in the search_controller.py file.
The problem for audio is actually the fact that we only have one provider at the moment, and when we send 20 requests at the same time to check if the links are dead or not, a large portion of the responses return errors.
One [temporary] solution could be to suspend checking for dead links for audio, since we have ingested the audio data only recently, and they are most likely all current.
Do we experience this bug in non-audio search as well?
I agree with @obulat. The system starts deleting images in cache when we go for high page numbers. This sometimes causes zero result set.
This seems to be the same problem of #13 and the solution is not trivial. I'm going to remove the "good first issues" label.
One mitigation approach we could employ here (based on this discussion from the Make WP Slack):
For cases where the number of results on a page is less than the page size, and the page itself is less than the expected page count, adjust the result_count and page_count values to reflect the actual number of results returned. An example:
Let's say that we have 100 results in ES, but only 50 of those are live links. We could iterate through the first 3 pages, and then on page 3 (of a projected 5 total pages) we'd have 10 results instead of 20 which is the page size. It's at page 3 that we'd report the actual total result count and number of pages.
I know this is a bigger change than we can implement for the existing v1 API, but this kind of problem makes me wonder whether we would be better off returning a pagination cursor that is returned as null when there are no more pages. The fact is that due to dead link filtering we have very limited ability to accurately report the number of pages and it's not unreasonable for someone to write code like this (which is not reliable with our current API):
first_page = requests.get(".../images?q=birds")
page_count = first_page["page_count"]
i = 2
while i <= page_count:
# request new page, do whatever you want
i += 1
For any query that returns a page count I would personally fully expect it to be stable through the iterations of the pages. But we cannot guarantee that because the page count is based on the full number of not-yet-validated results. We currently calculate page_count based on the total number of hits that ES reports. It's not even possible for us to easily apply our dead link masks to this, at least not that I can tell: https://github.com/WordPress/openverse-api/blob/4edae23b8e12b8ceb19c75652b215621128fb4fa/api/catalog/api/controllers/search_controller.py#L512-L513
Maybe we could sum the dead link mask and reduce that from the total count, but unless a query has been paginated to the end of all the ES results, it won't matter. And we know that won't happen because some queries will return more results than we have ever made possible to retrieve via the API due to ES deep pagination limits.
I've assigned myself the issue to work on the solution that Madison shared above, but I think as we look forward to a potentially v2 of the API, it would be good to rethink how we present pagination entirely.
adjust the
result_count
@AetherUnbound Did you have a way of doing this in mind? I'm not sure how to do this on the last page of a query given that we've skipped over a certain number of results. Given we already limit pagination far lower than some queries would ever be able to be exhausted (i.e., we have "inaccessible results" even that aren't dead links) I propose we leave the result_count alone and only update the page_count to be set to the currently requested page (when we detect that we've reached the last possible page for a query based on the current number of results for the page).
I really appreciate the alternative pagination solution you've offered @sarayourfriend! Thinking about what's the maximum amount of accurate information we could give, the pagination cursor seems like the best option. Many of the APIs we ingest behave this way, so it seems to be a common pattern at least.
I don't have any advice or thoughts on the specific implementation how to adjust result_count, it was merely a proposition I was putting out! If updating the page_count seems the most feasible, let's go with that 🙂 I'm sure most of our API consumers are using that rather than the result count anyway as you've mentioned.
Sounds good. Feel free to check out the PR when you have a chance if you're keen: #1032