courtlistener
courtlistener copied to clipboard
Use `search_after` pagination in the search API for deeper pagination
Working on changing the approach to pagination on the ES Search fronted I have encountered some challenges:
The problem with our current pagination approach, as noted in the documentation, is as follows:
Avoid using from and size to page too deeply or request too many results at once. Search requests usually span multiple shards. Each shard must load its requested hits and the hits for any previous pages into memory. For deep pages or large sets of results, these operations can significantly increase memory and CPU usage, resulting in degraded performance or node failures.
We are not doing pagination too deep because it's limited to up to 100 pages, so at most, users are allowed to request up to 2000 hits. However, this approach can still affect memory and CPU usage since ES has to load all the previous pages into memory for each user request.
This issue can be solved using the search_after
parameter for pagination.
The search_after
parameter is the sorting key of the last result on the previous page and it must be unique. The first challenge arises when sorting by relevance or dates, as we can get repeated values. I resolved this issue by adding a secondary sorting parameter, the document ID.
-
So the first problem is related to this new parameter. It's required to be passed in the GET request to access the next page, so we'll need to do something like:
&page=[0.0024906613, 200]
Here, the issue is that users won’t be able to control the page number by tweaking the parameters in the request; they'll have to use always the buttons. This change in the page parameter will also impact the Search API's pagination in documents that use ES.
-
The second issue is about going backwards.
search_after
is designed for forward navigation. Thus, there isn't an easy way to use it for going backward.For example, to go to a previous page, we'll need the sorting key from the last result on the page before the previous one. If a user is on page 4 and wants to go to page 3, we'll require the sorting key from the last result on page 2.
Storing these parameters for each user's search request doesn't seem like a good approach.
An alternative is to use traditional pagination based on the page number for going backward and
search_after
for going forward, assuming it's more common for users to navigate forward.However, using two different pagination approaches can lead to inconsistencies between backward and forward results if the index content is updated while the user is browsing the results.
This is a big tradeoff, and I guess it only affects pages beyond page one, right? If that's the case and if we mostly mitigate this by disallowing high pagination in the first place, maybe this isn't worth doing.
I could see it being an option for the API though, for example, if somebody wants to do deep pagination, they could do that. Or, I guess, if we wanted to enable deep pagination in the front end, maybe we could enable this starting on page 101?
Yes, this only directly affects requests beyond page one. However, indirectly, I think it can affect all requests. For instance, if 100 users are requesting page 50 at the same time, that means approximately 100,000 hits are stored in memory at that time, which can affect the overall performance of the cluster.
But if users typically don't go beyond a couple of pages, this doesn't seem like a significant issue.
I could see it being an option for the API though, for example, if somebody wants to do deep pagination, they could do that. Or, I guess, if we wanted to enable deep pagination in the front end, maybe we could enable this starting on page 101?
Exactly search_after
is the best option if we want to allow deep pagination.
Regarding backward pagination, one approach that could work is inverting the sorting order. For instance, when going forward, if the order is asc
, and considering each page has 5 documents:
sorting_key: 1,1
sorting_key:2,2
sorting_key:3,3
sorting_key:4,4
sorting_key:5,5 search_after = 5,6, will return 6,7,8,9
sorting_key:6,6
sorting_key:7,7
sorting_key:8,8
sorting_key:9,9
If user wants to go to a previous page, we could request the inverse order: desc
sorting_key:9,9
sorting_key:8,8
sorting_key:7,7
sorting_key:6,6
sorting_key:5,5 search_after = 5,5, will return 4,3,2,1
sorting_key:4,4
sorting_key:3,3
sorting_key:2,2
sorting_key:1,1
More testing is required, but it seems this could work.
So for now, should this wait until we need to enable deep pagination?
So for now, should this wait until we need to enable deep pagination?
Yeah, I expect it'll be of little benefit. Most searchers look at the first couple results, and if those results aren't good, they refine the query and run it again. It's pretty rare to do a query you're happy with and to then move past page two or three. It's more useful in the API, but we haven't heard too much about that so far. I'll rename this to apply only to the API, but I think we can set it aside as a "maybe someday" issue.
Let's evaluate if this should be part of the v4 API.
Regarding this one:
Pros:
- If we use search_after in the API, we could implement it alongside CursorPagination which aligns with https://github.com/freelawproject/courtlistener/issues/930
- We won't need to perform an additional count query as in V3 or on the frontend, since the count is not required for pagination.
- It offers better performance and solves the issue regarding deep pagination.
Cons:
- More complex to implement.
- Users cannot jump to a specific page.
- There can be some issues with sorting keys. So far, it seems that using two sorting keys, for instance, relevance and document ID, can solve the issue, but more testing is required.
Great, sounds like we should do it. The pros address real problems that we have in v3 and the cons seem workable (you're good at complexity, there's no good reason to jump to a specific page, and I'm optimistic sorting keys can be solved).
This one can also be closed.
Here are some insights into deep pagination performance for reference: https://github.com/freelawproject/courtlistener/issues/4032#issuecomment-2105283379