openverse-api icon indicating copy to clipboard operation
openverse-api copied to clipboard

Dead link filtering reduces number of queryable pages

Open dhruvkb opened this issue 4 years ago • 7 comments
trafficstars

Description

The dead link filtering algorithm significantly limits the pages that can be retrieved.

Reproduction

  1. Go to audio search endpoint: http://localhost:8000/v1/audio/. Note the number of pages.
  2. Go to a page number that's quite high but less than 250, like 200: http://localhost:8000/v1/audio/?page=200
  3. See the deep pagination error. This should not be happening.

Expectation

Page count calculation should be correct and should allow pagination upto the stated limit.

Resolution

  • [ ] 🙋 I would be interested in resolving this bug.

dhruvkb avatar Sep 01 '21 17:09 dhruvkb

One way to fix this would be to check if result_count is divisible by page_size. If not, increment page_count by 1.

ritesh-pandey avatar Sep 05 '21 12:09 ritesh-pandey

Updated this issue to actually reflect the second half of the bug report, the first being resolved in #172.

dhruvkb avatar Sep 06 '21 17:09 dhruvkb

HI @dhruvkb Can you please provide more details about this bug?

A brief search shows that api / controllers / search_controller is file where the bug is.

Methods

  1. _paginate_with_dead_link_mask
  2. _get_query_slice

Is this correct?

ritesh-pandey avatar Sep 17 '21 07:09 ritesh-pandey

@ritesh-pandey I added the steps to reproduce. You're right about the bug being in the search_controller.py file.

dhruvkb avatar Sep 17 '21 08:09 dhruvkb

The problem for audio is actually the fact that we only have one provider at the moment, and when we send 20 requests at the same time to check if the links are dead or not, a large portion of the responses return errors.

One [temporary] solution could be to suspend checking for dead links for audio, since we have ingested the audio data only recently, and they are most likely all current.

obulat avatar Sep 17 '21 09:09 obulat

Do we experience this bug in non-audio search as well?

I agree with @obulat. The system starts deleting images in cache when we go for high page numbers. This sometimes causes zero result set.

ritesh-pandey avatar Sep 17 '21 10:09 ritesh-pandey

This seems to be the same problem of #13 and the solution is not trivial. I'm going to remove the "good first issues" label.

krysal avatar Oct 07 '21 14:10 krysal

One mitigation approach we could employ here (based on this discussion from the Make WP Slack):

For cases where the number of results on a page is less than the page size, and the page itself is less than the expected page count, adjust the result_count and page_count values to reflect the actual number of results returned. An example:

Let's say that we have 100 results in ES, but only 50 of those are live links. We could iterate through the first 3 pages, and then on page 3 (of a projected 5 total pages) we'd have 10 results instead of 20 which is the page size. It's at page 3 that we'd report the actual total result count and number of pages.

AetherUnbound avatar Nov 30 '22 03:11 AetherUnbound

I know this is a bigger change than we can implement for the existing v1 API, but this kind of problem makes me wonder whether we would be better off returning a pagination cursor that is returned as null when there are no more pages. The fact is that due to dead link filtering we have very limited ability to accurately report the number of pages and it's not unreasonable for someone to write code like this (which is not reliable with our current API):

first_page = requests.get(".../images?q=birds")
page_count = first_page["page_count"]

i = 2
while i <= page_count:
  # request new page, do whatever you want
  i += 1

For any query that returns a page count I would personally fully expect it to be stable through the iterations of the pages. But we cannot guarantee that because the page count is based on the full number of not-yet-validated results. We currently calculate page_count based on the total number of hits that ES reports. It's not even possible for us to easily apply our dead link masks to this, at least not that I can tell: https://github.com/WordPress/openverse-api/blob/4edae23b8e12b8ceb19c75652b215621128fb4fa/api/catalog/api/controllers/search_controller.py#L512-L513

Maybe we could sum the dead link mask and reduce that from the total count, but unless a query has been paginated to the end of all the ES results, it won't matter. And we know that won't happen because some queries will return more results than we have ever made possible to retrieve via the API due to ES deep pagination limits.

I've assigned myself the issue to work on the solution that Madison shared above, but I think as we look forward to a potentially v2 of the API, it would be good to rethink how we present pagination entirely.

sarayourfriend avatar Dec 06 '22 03:12 sarayourfriend

adjust the result_count

@AetherUnbound Did you have a way of doing this in mind? I'm not sure how to do this on the last page of a query given that we've skipped over a certain number of results. Given we already limit pagination far lower than some queries would ever be able to be exhausted (i.e., we have "inaccessible results" even that aren't dead links) I propose we leave the result_count alone and only update the page_count to be set to the currently requested page (when we detect that we've reached the last possible page for a query based on the current number of results for the page).

sarayourfriend avatar Dec 06 '22 03:12 sarayourfriend

I really appreciate the alternative pagination solution you've offered @sarayourfriend! Thinking about what's the maximum amount of accurate information we could give, the pagination cursor seems like the best option. Many of the APIs we ingest behave this way, so it seems to be a common pattern at least.

I don't have any advice or thoughts on the specific implementation how to adjust result_count, it was merely a proposition I was putting out! If updating the page_count seems the most feasible, let's go with that 🙂 I'm sure most of our API consumers are using that rather than the result count anyway as you've mentioned.

AetherUnbound avatar Dec 12 '22 20:12 AetherUnbound

Sounds good. Feel free to check out the PR when you have a chance if you're keen: #1032

sarayourfriend avatar Dec 12 '22 21:12 sarayourfriend