courtlistener
courtlistener copied to clipboard
3033 Introduced V4 RECAP Search API
This PR introduces version 4 of the RECAP Search API as outlined in #3033, which will serves as the base for other version 4 API search types.
Here are the main features of this endpoint:
Search types:
The v4 RECAP Search API supports three different search types. All of them operate similarly in terms of queries, they support the same queries as the frontend, but they differ in how results are displayed. For example, the Docket type "d"
can match dockets by RD fields, although RD fields are not displayed. The RECAP_DOCUMENT type "rd
" can match RDs by docket fields that are indexed within each RD.
RECAP "r"
This is the main type and it mimics the RECAP Search results in the frontend. The objects estructure looks like this:
{
"assignedTo": null,
"assigned_to_id": null,
"attorney": [],
"attorney_id": [],
"caseName": "Bear River Band of Rohnerville Rancheria v. California Department of Social Services",
"case_name_full": "",
"cause": "42:1983 Civil Rights Act",
"chapter": null,
"court": "District Court, S.D. New York",
"court_citation_string": "S.D.N.Y.",
"court_exact": "nysd",
"court_id": "nysd",
"dateArgued": null,
"dateFiled": null,
"dateTerminated": null,
"date_created": "2024-04-03T19:37:37.412570Z",
"docketNumber": "4:23-cv-01809",
"docket_absolute_url": "/docket/181/bear-river-band-of-rohnerville-rancheria-v-california-department-of-social/",
"docket_id": 181,
"firm": [],
"firm_id": [],
"jurisdictionType": "",
"juryDemand": "",
"more_docs": false,
"pacer_case_id": 411140,
"party": [],
"party_id": [],
"recap_documents": [
{
"absolute_url": "/docket/181/54/bear-river-band-of-rohnerville-rancheria-v-california-department-of-social/",
"attachment_number": null,
"cites": [],
"description": "ORDER by Judge Haywood S. Gilliam, Jr. GRANTING 53 Stipulation to Extend Plaintiffs' Deadline to File Second Amended Complaint. Amended Pleadings due by 4/15/2024. (ndr, COURT STAFF) (Filed on 4/1/2024)",
"docket_entry_id": 15081,
"document_number": 54,
"document_type": "PACER Document",
"entry_date_filed": "2024-04-01",
"entry_number": 1,
"filepath_local": null,
"id": 15082,
"is_available": false,
"pacer_doc_id": "035024238118",
"page_count": null,
"short_description": "",
"snippet": "",
"timestamp": "2024-04-11T07:22:22.188855Z"
}
],
"referredTo": null,
"referred_to_id": null,
"suitNature": "Civil Rights: Other",
"timestamp": "2024-04-11T07:22:22.144353Z",
"trustee_str": null
}
- The outer level displays the Docket fields.
- The
recap_documents
field displays up to 5 RECAPDocuments that matched the query. - If a result contains more than 5 matched RDs,
more_docs
will be true, indicating there are additional RDs matched by the query. Otherwise, it is false. - Date values are shown as date objects without a timezone.
- Datetime values are displayed as ISO-8601
datetime
in UTC.
DOCKETS "d"
This search type only display docket fields without the "recap_documents" or more_docs
fields.
{
"assignedTo": null,
"assigned_to_id": null,
"attorney": [],
"attorney_id": [],
"caseName": "Harris v. Broomfield",
"case_name_full": "",
"cause": "42:1983 Prisoner Civil Rights",
"chapter": null,
"court": "District Court, N.D. California",
"court_citation_string": "N.D. Cal.",
"court_exact": "cand",
"court_id": "cand",
"dateArgued": null,
"dateFiled": null,
"dateTerminated": null,
"date_created": "2024-04-12T17:18:01.859134Z",
"docketNumber": "4:21-cv-00283",
"docket_absolute_url": "/docket/195/harris-v-broomfield/",
"docket_id": 195,
"firm": [],
"firm_id": [],
"jurisdictionType": "",
"juryDemand": "",
"pacer_case_id": 371855,
"party": [],
"party_id": [],
"referredTo": null,
"referred_to_id": null,
"suitNature": "Prisoner: Prison Condition",
"timestamp": "2024-04-12T17:18:01.909951Z",
"trustee_str": null
}
Regarding the d
and r
types, I noticed that the parties and attorneys fields can be massive in some cases, which can make some responses quite large.
Should we consider displaying a maximum number of parties and attorneys?
RECAP_DOCUMENT "rd" This search type only display RECAPDocuments fields.
{
"absolute_url":"/docket/180/275/rice-v-city-and-county-of-san-francisco/",
"attachment_number":null,
"cites":[
],
"description":" Order by Judge Laurel Beeler regarding 233 Bill of Costs. In the attached order, the court taxes the full amount of claimed costs ($19,469.61). (lblc1, COURT STAFF) (Filed on 3/31/2024)Any non-CM/ECF Participants have been served by First Class Mail to the addresses of record listed on the Notice of Electronic Filing (NEF)",
"docket_entry_id":15080,
"document_number":275,
"document_type":"PACER Document",
"entry_date_filed":"2024-03-31",
"entry_number":1,
"filepath_local":"recap/dev.gov.uscourts.cand.345347/gov.uscourts.cand.345347.275.0.pdf",
"id":15081,
"is_available":true,
"pacer_doc_id":"035024237367",
"page_count":6,
"short_description":"",
"snippet":" Case 3:19-cv-04250-LB Document 275 Filed 03/31/24 Page 1 of 6\n\n\n\n\n 1\n\n 2\n\n 3\n\n 4\n\n 5\n\n 6\n\n 7\n\n 8 UNITED STATES DISTRICT COURT\n\n ",
"timestamp":"2024-04-11T07:22:22.123649Z"
}
I realized that this document type can be useful to accomplish the same results of the docket_id
query in the frontend, for instance a query like this:
?order_by=score+desc&type=rd&q=docket_id:1
Will return all RDs that belong to that docket_id
. So I did not add the docket_id
query to the r
type, which would increase the number of nested documents from 5 to 100, maybe the rd
type is just better for the same objective?
One question here, would it be necessary to add some Docket fields to this serializer so users can easily identify the parent docket? Perhaps adding the docket_id
?
Results Count
Since the total count of results is no longer required for computing pagination, an additional count query is not required. Instead, the count is taken from the main query. With the count limit of 10,000, we can show the actual number of results if they're less than 10,000.
So the count key looks like this:
"count": {
"exact": 144,
"more": false
}
exact
is the number of results matched by the query, up to 10,000 items.
more
indicates if there are more than 10,000 documents it'll True.
Suggestions are welcome if you have a better way to display the exact count and indicate that there are more results.
Cursor pagination and sorting
As requested in #3645, the V4 now uses cursor pagination. The cursor paginator is custom made to work alongside the ES search_after
parameter, enhancing performance during deep pagination. However the architecture of our ESCursorPagination
class follows the standards of CursorPaginator
.
For the cursor paginator to function, it is mandatory to set a sorting key used as the "cursor" in the ES request. The supported sortings are:
"score desc" # Default
"dateFiled desc"
"dateFiled asc"
"entry_date_filed asc"
"entry_date_filed desc"
Additionally, to avoid pagination inconsistencies due to repeated values like scores or dates, a secondary sorting key, which must be unique for each result, is required. For the 'r' and 'd' types, this key is docket_id
, and for the 'rd' type, it is the RD id
. It's important to note that using two sorting keys can lead to discrepancies between sorting in the frontend and in the API v4, even though the primary sorting field remains the same. The introduction of the secondary sorting key will sort results with duplicated sorting values in a docket_id desc
or id desc
order, acting as a tiebreaker to ensure consistent results across pages. In the frontend, the order of documents with the same sorting values is displayed arbitrarily.
For date sorting such as 'dateFiled', a workaround was necessary regarding sorting and the search_after
request. The issue arises when Dockets with a None
'dateFiled' are indexed as null in ES. However, when using this field for sorting, the null fields are represented as -9,223,372,036,854,775,808
, which is the long.min_value
in JAVA, causing an illegal value error when sent as part of the search_after
parameter.
The workaround applied was to use the function score (with a few tweaks) we are currently using for the entry_date_filed
sorting in the frontend. Thus, when a document with a 'dateFiled' of None is in the results, its sorting value is 0 instead of long.min_value
.
So when sorting by either:
"dateFiled desc"
"dateFiled asc"
"entry_date_filed asc"
"entry_date_filed desc"
Results where the sorting field is None will be shown at the end, regardless of the order (desc
or asc
).
By default, ES does not provide a "search_before" parameter for backward pagination. A different approach was required to implement cursor backward pagination. It uses the same search_after
approach, but when going backwards, the sorting keys are inverted, and the item selected as the "search_after" is the first item on the page, allowing it to indeed go back to the previous page. A final step is required, as the results for this backward query are returned in the reverse order of the original one. It is necessary to invert the results on that page to achieve the original order.
Results per page are controlled by the setting: SEARCH_API_PAGE_SIZE
, which defaults to 20.
random sorting
The random sorting key is being omitted from this PR, as it currently uses a sorting script instead of a function score, so it would require a change in approach to work alongside the search_after
parameter and might not function properly due to the randomness of the search_after
parameter. Therefore, as we agreed, it will not be implemented at this time.
In general, cursor pagination is working as expected, and results are consistent across pages, even in scenarios where new items are indexed or removed. However, there can be some corner cases where the cursor pagination can lead to inconsistent results, for instance, if the last or first document on the current page used as the cursor is updated before moving to the next page and the field used as the sorting key (like 'dateFiled') or if sorting by relevance, the update affects the document score. This can lead to inconsistencies in results when moving to the next or previous page, where the updated document or documents can be displayed again. To solve this issue, ES documentation recommends to use a Point in time. I'll open an issue to describe how it works in detail in case it is required to implement in the future.
The cursor is a base-64 encoded string that looks like:
cursor=cz0yLjIwNzg0ODUmcz0xNjYmdD1k
It contains the following parameters:
- search_after: The ES
search_after
parameter. - reverse: True if performing backward pagination, False if going forward.
- search_type: The search type to which the cursor belongs.
If an invalid cursor string is sent, the response will contain the following body:
{
"detail": "Invalid cursor"
}
Also, if a user switches to a different search type, for instance, if the original request was performed for the "r"
type and then it's changed to "d"
without cleaning the current cursor, the Invalid cursor error will be shown to avoid pagination inconsistencies due to the current cursor not matching the sorting values of the new search type.
On every ES request, 'page_size + 1' documents are requested to check whether there are more results and to determine whether to display a next or previous page.
The next and previous links look like these:
"next": "http://localhost:8000/api/rest/v4/search/?cursor=cz0yLjIwNzg0ODUmcz0xNDYmdD1k&order_by=score+desc&type=d",
"previous": "http://localhost:8000/api/rest/v4/search/?cursor=cz0yLjIwNzg0ODUmcz0xNjUmcj0xJnQ9ZA%3D%3D&order_by=score+desc&type=d",
Highlighting on demand.
By default, highlighting in the v4 Search API is disabled, providing a performance boost to the requests. When highlighting is disabled in the RECAP results, the plain text snippet (first 500 characters) is extracted directly from the database for the results on a page. This is because highlighting is required to retrieve the 'no_match' fragment and to avoid retrieving the entire plain text, which can be expensive. Thus, to fully benefit from disabling highlighting, data extraction from the database is necessary.
To enable highlighting, users should pass the highlight=on parameter in the request.
Highlighted fields include: Dockets
"assignedTo",
"caseName",
"cause",
"court_citation_string",
"docketNumber",
"juryDemand",
"referredTo",
"suitNature",
RDs
short_description"
"description"
"plain_text"
Let me know what do you think.
HERE WE GOOOO!
Should we consider displaying a maximum number of parties and attorneys?
I don't think so. If the backend can handle it then the consumer can too.
maybe the rd type is just better for the same objective?
Yes. Seems fine.
One question here, would it be necessary to add some Docket fields to this serializer so users can easily identify the parent docket? Perhaps adding the docket_id?
Yeah, that makes sense. Do we have the docket_entry_id in the recap_docket object in ES also?
Suggestions are welcome if you have a better way to display the exact count and indicate that there are more results.
Hm, we can do fast count queries that lose accuracy after a certain point, right? I'd say we do that instead. We can just provide the approximate count, and then document that it's only accurate for result sets smaller than XXX (whatever it was we discussed before, if we made a decision about it).
using two sorting keys can lead to discrepancies between sorting in the frontend and in the API v4
That's not a big deal, but we should open an issue to have it on our backlog as a "someday" issue. Seems easy to fix on the front end, right?
Highlighted fields include...
How does this compare to the front end?
Let me know what do you think.
This all sounds good. My one concern is that the backwards pagination sounds like a pretty big hack. Is it something of a best practice or something you came up with to solve the problem?
I just did a very quick skim of the code. @ERosendo, if you can do a full review, that would be great, and I'll do longer review after that.
Yeah, that makes sense. Do we have the docket_entry_id in the recap_docket object in ES also?
Correct, I'll add the docket_id
to the response. And yeah, the docket_entry_id
is already in the rd response.
Hm, we can do fast count queries that lose accuracy after a certain point, right? I'd say we do that instead. We can just provide the approximate count, and then document that it's only accurate for result sets smaller than XXX (whatever it was we discussed before, if we made a decision about it).
Sure, so that means we'd only provide the "count" key containing the accurate result if it's below a threshold, or the approximate count if it's greater.
The threshold we defined in #3926 was 2,000 documents, considering up to 100 pages of 20 documents each.
We can perform the same cardinality count with accurate results up to that threshold, but since we're currently getting an accurate count up to 10,000 items from the main query, we can use that count if the results are less than 10,000 and use the approximate count returned by the cardinality aggregation if they exceed 10,000. And the "more"
key won't be shown in any case.
Does that sound good?
That's not a big deal, but we should open an issue to have it on our backlog as a "someday" issue. Seems easy to fix on the front end, right?
Sure, here is the issue: https://github.com/freelawproject/courtlistener/issues/3999
How does this compare to the front end?
The HL fields in the front end are the same as in the "r" search type (docket + RD fields) when highlight
is enabled. The "d" type only highlights docket fields, and the "rd" type only highlights RD fields.
My one concern is that the backward pagination sounds like a pretty big hack. Is it something of a best practice or something you came up with to solve the problem?
Initially, it was just a brief idea I came up with in https://github.com/freelawproject/courtlistener/issues/3645#issuecomment-1904684499 when assessing the search_after
parameter, but I was not sure it would work. The problem that remained unsolved was that the results were inverted on every page, so we'd need to invert them again on every page when going backward. Upon inspecting and analyzing the CursorPagination class in DRF, I noticed that it follows the same approach of inverting the sorting when going backward and also inverting the order on every page before returning the results to the users. So I felt more confident to implement the solution, as it employs the same principles, albeit with variations for SQL.
Everything above sounds great, thanks Alberto. My last comment is in response to this:
The "d" type only highlights docket fields, and the "rd" type only highlights RD fields.
That sounds right, but to be certain I understand correctly, are there fields that are shown on both the RD and the R search type that are not highlighted in RD even though they are highlighted in the R search type (or the same question for the D result type)?
That sounds right, but to be certain I understand correctly, are there fields that are shown on both the RD and the R search type that are not highlighted in RD even though they are highlighted in the R search type (or the same question for the D result type)?
All the fields HL in the r
type, are also HL in the rd
type and d
type, according to the fields available on each type. For instance, the RECAPDocuments fields are not available in the d
type and the Docket fields are not available in the rd
type.
Let me explain it with a response example for each type. The HL fields in each search type will look like this; note that I'm omitting all the non-HL fields for simplicity.
"r" type:
{
"assignedTo":"<mark>Lorem</mark> Ipsum",
"caseName":"Bear <mark>Lorem</mark> River Band of Rohnerville Rancheria v. California Department of Social Services",
"cause":"42:1983 Civil Rights Act <mark>Lorem</mark>",
"court_citation_string":"S.D.N.Y. <mark>Lorem</mark>",
"juryDemand":"<mark>Lorem</mark>",
"referredTo":"<mark>Lorem</mark>",
"suitNature":"Civil Rights: <mark>Lorem</mark>",
"recap_documents":[
{
"description":"ORDER <mark>Lorem</mark> by Judge Haywood S. Gilliam, Jr. GRANTING 53 Stipulation to Extend Plaintiffs",
"short_description":"<mark>Lorem</mark> Ipsum",
"snippet":"<mark>Lorem</mark> Ipsum"
}
],
"more_no_hl_fields"
}
"d" type:
{
"assignedTo":"<mark>Lorem</mark> Ipsum",
"caseName":"Bear <mark>Lorem</mark> River Band of Rohnerville Rancheria v. California Department of Social Services",
"cause":"42:1983 Civil Rights Act <mark>Lorem</mark>",
"court_citation_string":"S.D.N.Y. <mark>Lorem</mark>",
"juryDemand":"<mark>Lorem</mark>",
"referredTo":"<mark>Lorem</mark>",
"suitNature":"Civil Rights: <mark>Lorem</mark>",
"more_no_hl_fields"
}
"rd" type:
{
"description":"ORDER <mark>Lorem</mark> by Judge Haywood S. Gilliam, Jr. GRANTING 53 Stipulation to Extend Plaintiffs",
"short_description":"<mark>Lorem</mark> Ipsum",
"snippet":"<mark>Lorem</mark> Ipsum",
"more_no_hl_fields"
}
Perfect, that's what I was expecting, but just wanted to be sure! Thank you!
Great, I've added the docket_id
field as part of the response for the "rd" search type.
I've also changed the count
key to display the exact number of results if they are less than or equal to 10,000 hits, and an approximate count if there are more than 10,000 hits. This is done using a cardinality query based on an aggregation of the docket_id
for the r
and d
types (which display dockets) and the id
(RD pk) for the rd
type (which displays RDs).
The main query and the cardinality query are performed in the same request using the ES Multi-search API.
Additionally, I've added error handling for the ES request related to the Search API. I introduced two custom errors:
ElasticServerError
: For handling TransportError, ConnectionError, RequestError
ElasticBadRequestError
: For handling query parsing ApiError
Other ApiError
not related to parsing also raise: ElasticServerError
Let me know what you think.
I think the next step, while this is being merged, is to start working on the V4 Search API documentation, correct? Or V4 API Opinions Search?
Thanks Alberto. All sounds good, and we await Eduardo's review.
One other thought: Is it hard to add a second count to the r
results? One for the docket count and one for the recap document count, like we have on the front end? I'm eventually hoping to use this API for the front end (with HTMX), so I'm thinking about gaps we'd have to fill if we did that.
One other thought: Is it hard to add a second count to the r results? One for the docket count and one for the recap document count, like we have on the front end? I'm eventually hoping to use this API for the front end (with HTMX), so I'm thinking about gaps we'd have to fill if we did that.
It's not difficult to add a count for the recap documents. I didn't include it because it requires an additional ES query for every API request, and the current V3 only displays one count.
However, if it's beneficial for users to have the recap document count in r
type, we could add a secondary count named document_count
or similar.
We might also use this for the frontend in the future. Alternatively, if the document count is only useful for the frontend, we could add an extra parameter to the request, something like document_count=on
This way, the response returns the document count for use in the frontend while it remains disabled in the API to save a query.
I think it's OK to add it to all responses, particularly if we do the cardinality thing that should make it pretty performant?
I think it's OK to add it to all responses, particularly if we do the cardinality thing that should make it pretty performant?
Yeah! I'll add the secondary count for the r
type then.
The document_count
provided by a cardinality query is added to the r
type.
The d
and rd
types lack this key in the response. So this is an additional detail that we'll need to explain in the documentation.
Thank you @ERosendo I've applied your suggestions, let me know if they look good to you.
@albertisfu LGTM 👍
@blancoramiro, this has a noop migration, so it needs a little hand-holding. Do you think you can get it deployed, please?
Migrations applied and merge deployed!