courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Develop v4 Search API endpoints for documents that use join fields in Elasticsearch.

Open albertisfu opened this issue 1 year ago • 15 comments

Now that we're using a Join field for indexing position child documents in Judge search we may want to create a new API endpoint or API version to bring support for search results with nested documents.

e.g:

{
   "name":"Curtis Clinton",
   "name_reverse":"Clinton, Curtis",
   "....",
   "positions":[
      {
         "court_exact":"ca1",
         "position_type":"jud",
         "..."
      }{
         "court_exact":"ca2",
         "position_type":"jud",
         "..."
      }
   ]
}

Since we don't want to break the current Search API changing to this new results structure.

This might be also required when working on other documents that also use a parent-child approach, like RECAP.

albertisfu avatar Aug 19 '23 01:08 albertisfu

I agree, we should do this. It'll be v4 of the API. v3 can continue working as for as long as needed. We'll log who's using it and which APIs are being used, and then we'll slowly move folks to v4. But this can launch after everything else, right?

mlissner avatar Aug 24 '23 21:08 mlissner

Yeah, this can be launched after we move all the documents to Elasticsearch, for now we'll make sure V3 Search API has the same output in all the documents.

albertisfu avatar Aug 25 '23 00:08 albertisfu

From https://github.com/freelawproject/courtlistener/pull/3648/, it seems like getting counts is pretty expensive. I'd suggest that for the new Elastic API we don't supply counts to users or that if we do, we do something a bit like Gmail:

  • if it's less than 10k results, we provide the number
  • if it's more, we provide something like, "page 1 of many"

Here's that idea in Gmail and likely for the same reason:

image

Google.com used to do this back in the day too, I think!

mlissner avatar Jan 23 '24 22:01 mlissner

Per our discussion today, here's how the migration to v4 of the API will go:

  1. We need a list of all the people that are using v3 of the search APIs or v1 of the search webhooks, so we can email them the deprecation plan and schedule.

    For the plan, we need details for each object type of:

    • The changes that will be needed for v3 to be powered by Elastic (see https://github.com/freelawproject/courtlistener/issues/3632#issuecomment-2010815640 for opinions, and we need this for the other endpoints as well).
    • The migration path to v4 of the API (see: https://github.com/freelawproject/courtlistener/issues/3874 about a migration guide)
  2. We need to create v4 of the API for all objects: opinions, people, RECAP and oral arguments. Oral args already uses Elastic, but it still needs to have a v4.

  3. Launch v4 of the API with its documentation: https://github.com/freelawproject/courtlistener/issues/3874

  4. Launch v2 of the webhook API with its documentation.

  5. Once those are launched, the deprecation timeline begins. People will have a short period of time to upgrade to v2 of webhooks and v4 of the API.

  6. Switch to serving old webhooks and API with Elastic.

  7. See who is still on the old versions and try to upgrade them.

  8. Finally launch opinions search in the UI, alerts, etc.

mlissner avatar Mar 20 '24 23:03 mlissner

I reviewed whether it's possible to retrieve the users who are using the Search API. However, I'm afraid it is not possible to obtain a list of users for this specific endpoint directly.

We have counts for each endpoint, but that set does not include user information. We have counts per user, both globally and per day, but these sets do not include endpoint data.

Therefore, from the Redis stats we could only retrieve the list of users for all endpoints using the following script:

import csv
from pathlib import Path

from django.conf import settings
from django.contrib.auth.models import User
from cl.lib.redis_utils import get_redis_interface

r = get_redis_interface("STATS")

user_counts_key = 'api:v3.user.counts'
api_users = r.zrange(user_counts_key, 0, -1, withscores=True)

user_requests_dict = {user_id: requests_count for user_id, requests_count in api_users}
user_ids = [int(user[0]) for user in api_users if user[0] not in ["None", "AnonymousUser"]]

media_path = Path(settings.MEDIA_ROOT)
# The directory for storing the CSV
api_directory = media_path / 'api_data'
api_directory.mkdir(parents=True, exist_ok=True)
csv_file_path = api_directory / 'api_users.csv'
# Write the data to a CSV file
api_users_queryset = User.objects.filter(id__in=user_ids).values_list("pk", "email")
with open(csv_file_path, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['UserID', 'User Email', 'Number of Requests'])

    # Save the user details
    for user_id, email in api_users_queryset.iterator():
        user_requests = user_requests_dict[str(user_id)]
        writer.writerow([user_id, email, user_requests])

So, we can email all the API users.

To obtain a list of users specifically for the Search API, we would need to analyze the Cloudfront logs to identify unique IPs that have requested /api/rest/v3/search/ in the last few months. Then, we could attempt to map those IPs to users using the data in Redis.

Let me know what do you think.

albertisfu avatar Mar 21 '24 16:03 albertisfu

OK, that's fine, because we might as well release v4 of the whole API at the same time anyway. Thanks.

Let's get v4 launched in parallel with v3 and then we'll know all the changes people will need to make, which we can explain in our deprecation email to the people above.

mlissner avatar Mar 21 '24 21:03 mlissner

(Oh, and using Cloudfront logs probably won't work because we delete logs according to our privacy policy — I think.)

mlissner avatar Mar 21 '24 21:03 mlissner

Code ran fine, but I needed one tweak which I've edited into it above.

We've got about 7000 API users that have made about 220M requests.

mlissner avatar Mar 22 '24 00:03 mlissner

One question here, Should we follow a specific order when developing the new, different endpoints for the V4 Search API? According to the plan outlined above, all the new endpoints will be launched simultaneously, correct?

Regarding the new Search API content:

The main difference will be the support of nested documents. For instance, for Opinions, it might look like this:

{
   "absolute_url":"/opinion/403/strickland-v-washington/",
   "attorney":"Lorem Attorney",
   "caseName":"Strickland v. Washington.",
   "citation":null,
   "citeCount":1,
   "cluster_id":403,
   "court":"court of the Medical Worries",
   "court_citation_string":"",
   "court_exact":"canb",
   "court_id":"canb",
   "dateArgued":"2019-11-18",
   "dateFiled":"2020-08-15",
   "dateReargued":null,
   "dateReargumentDenied":null,
   "date_created":"2024-03-27T13:57:15.696207-00:00",
   "docketNumber":"1:21-cv-1234",
   "docket_id":564,
   "judge":"",
   "lexisCite":"",
   "neutralCite":"",
   "panel_ids":null,
   "scdb_id":"",
   "sibling_ids":[
      558
   ],
   "status":"Published",
   "suitNature":"",
   "timestamp":"2024-03-27T13:57:16.185406-00:00",
   "opinions":[
      {
         "id":558,
         "author_id":561,
         "type":"combined-opinion",
         "per_curiam":false,
         "download_url":null,
         "local_path":null,
         "text":"Code, § 1-815",
         "sha1":"",
         "cites":[
            107978,
            531313
         ],
         "joined_by_ids":[
            
         ]
      }
   ]
}

Additionally, here are some decisions we could consider:

  • Date objects can now be rendered as a date without time, e.g., dateFiled.
  • DateTime objects can be rendered in UTC instead of PT.
  • Empty lists could be rendered as an empty array.
  • Empty strings will be rendered as "", while other types of objects like dates or integers can be rendered as null.

What do you think about it? Do you have any additional decisions to consider regarding the content?

albertisfu avatar Mar 30 '24 01:03 albertisfu

One question here, Should we follow a specific order when developing the new, different endpoints for the V4 Search API? According to the plan outlined above, all the new endpoints will be launched simultaneously, correct?

I think the plan is to launch v4 in parallel with v3, so for that reason it's fine to roll them out one at a time (nobody has to know v4 is launched for one endpoint or another. It can just be there without us telling anybody.

The order I think we should do them in is from most complexity to least, unfortunately. That way, we can figure out the hard parts, document them, write a migration document, and start telling people about the changes as soon as possible.

Regarding the new Search API content:

The main difference will be the support of nested documents.

Yes.

Additionally, here are some decisions we could consider:

  • Date objects can now be rendered as a date without time, e.g., dateFiled.

Sure, that feels like an improvement we should do.

  • DateTime objects can be rendered in UTC instead of PT.

Yes, that will be better.

  • Empty lists could be rendered as an empty array.

What are they now? I would expect an empty list to be an empty list/array. Are they null now?

  • Empty strings will be rendered as "", while other types of objects like dates or integers can be rendered as null.

This is the django way. I'd stick with "".

I think that's it for me. I can't think of much else, aside from the fixes in other issues.

mlissner avatar Mar 30 '24 05:03 mlissner

Great! Thank you for your answers. The most complex seems to be RECAP, as it involves various sorting keys, including one that depends on documents. Thus, that'd be the first one to develop.

What are they now? I would expect an empty list to be an empty list/array. Are they null now?

Yes, currently, they are null.

I think that's it for me. I can't think of much else, aside from the fixes in other issues.

Perfect! I'll start working on this. Let me know if we receive additional feedback from users that should be incorporated during the process.

albertisfu avatar Apr 01 '24 15:04 albertisfu

@mlissner working on the serializer for the v4 RECAP Search API I got a couple of questions so far:

The serializer is looking like this so far, basically it primarily displays the Docket fields, along with the RECAP Documents that matched the query, nested within the recap_documents array. In this setup, RECAP Documents are shown after filtering out any Docket fields duplicated in the RECAP Documents, which are already displayed at the Docket level.

[
   {
      "docket_slug":"pellegrini-v-omalley",
      "docket_absolute_url":"/docket/102/pellegrini-v-omalley/",
      "court_exact":"njd",
      "party_id":[
         
      ],
      "party":[
         
      ],
      "attorney_id":[
         
      ],
      "attorney":[
         
      ],
      "firm_id":[
         
      ],
      "firm":[
         
      ],
      "docket_child":"docket",
      "timestamp":"2024-04-11T07:22:16.270652",
      "docket_id":102,
      "caseName":"Pellegrini v. O'Malley",
      "case_name_full":"",
      "docketNumber":"3:23-cv-03853",
      "suitNature":"Social Security: SSID Tit. XIV",
      "cause":"42:427 Social Security Benefits",
      "juryDemand":"",
      "jurisdictionType":"",
      "dateArgued":null,
      "dateFiled":null,
      "dateTerminated":null,
      "assignedTo":null,
      "assigned_to_id":null,
      "referredTo":null,
      "referred_to_id":null,
      "court":"District Court, D. New Jersey",
      "court_id":"njd",
      "court_citation_string":"D.N.J.",
      "chapter":null,
      "trustee_str":null,
      "date_created":"2024-04-03T19:24:26.977578+00:00",
      "pacer_case_id":"416250",
      "recap_documents":[
         {
            "id":14999,
            "docket_entry_id":14998,
            "description":" ORDER  by Judge Thomas S. Hixson granting  11   Plaintiff's Motion for Summary Judgment; denying  15   Defendant's Motion for Summary Judgment. This matter is REMANDED for further administrative proceedings consistent with this order.  (tshlc1, COURT STAFF) (Filed on 4/2/2024)",
            "entry_number":1,
            "entry_date_filed":"2024-04-02",
            "short_description":"",
            "document_type":"PACER Document",
            "document_number":"19",
            "pacer_doc_id":"035024245133",
            "attachment_number":null,
            "is_available":true,
            "page_count":29,
            "plain_text":"11 <mark>v</mark> SUMMARY JUDGMENT",
            "filepath_local":"recap/dev.gov.uscourts.cand.416250/gov.uscourts.cand.416250.19.0.pdf",
            "absolute_url":"/docket/102/19/pellegrini-v-omalley/",
            "cites":[
               
            ],
            "timestamp":"2024-04-11T07:22:16.333700"
         },
         {},
         {},
         {},
         {}
      ]
   }
]

In the frontend, Dockets are displayed as primary documents, and up to 5 RECAP Documents that matched the query are nested within. If no RECAP Documents are matched, it shows 5 child documents matched by a match_all query.

  • My question is whether we should preserve the same behavior in the API. Should we display up to 5 RECAP Documents that also matched the query? And if no RECAP Documents were matched, should we show 5 child documents within the docket matched by a child match_all query, or should the recap_documents array be empty if no RECAP Documents were matched?

  • Also un the frontend, if there are more than 5 RECAP Documents matched in a docket, a button labeled Show additional results for this case is displayed. Should we display a field that indicates more child documents were matched by the query and include a link to the API query URL, but constrained to the docket_id? If we do this constrained query to the docket_id and want to preserve the same behavior as in the frontend, we should display up to 100 nested RECAP Documents for that docket. If, in this constrained query to the docket, there are more than 100 RECAP Documents that matched the query, it indicates there are more than 100 RECAP Documents that matched the query, and a button that points to the Docket page is displayed in the frontend. So, in the API, should we indicate somehow that there are more than 100 results for this constrained query?

  • Regarding highlighting: In V3 of the Search API, the only field that gets highlighted is the snippet field, which previously contained almost all the fields of the document. In Elastic version of V3 Search Opinions. API, we continue using the snippet field and get it highlighted; however, it now only gets its content from the Opinion text field. So, my question is whether we should continue adding a snippet field, or now it will be unnecessary and instead, we should show the direct plain_text field and get it highlighted?

Will the plain_text from the RECAPDocuments be the only field that we should highlight in the RECAP Search API results?

albertisfu avatar Apr 11 '24 19:04 albertisfu

My question is whether we should preserve the same behavior in the API. Should we display up to 5 RECAP Documents that also matched the query?

Yes.

And if no RECAP Documents were matched, should we show 5 child documents within the docket matched by a child match_all query, or should the recap_documents array be empty if no RECAP Documents were matched?

The latter. It should be empty like in the front end.

Note also that it's currently possible to set the type=d, and get back dockets as results, with no nested content. We should make sure we still support that, because some people just want to search dockets and get the performance boost of doing that.

I think we should also add a feature to search just documents via a new type parameter, type=d. That'd be close to the current functionality, I think, but would just search documents. This can come in a future iteration if it's hard. What do you think?

Should we show a button if there are more than five nested documents?

Do we have the count of nested items and can we just show that? I'd suggest we just do that, and if people want to search for those documents, they can use the type=d parameter I was just mentioning?

Regarding highlighting...

Good questions. I don't think we should show the plaintext field, because it's too big to do for 10×5 results. That'd be a huge response, and the snippet is a good thing to show instead.

I think we should use highlighting like we do on the front end (ie, all the fields), but it'd be nice if had to be opted into via a highlight=true parameter. That way, most people get fast API results, but if you need highlighting, you enable it and take the performance hit.

What do you think?

mlissner avatar Apr 11 '24 21:04 mlissner

Great, thanks! Some follow-up questions:

The latter. It should be empty like in the front end.

Well, in the frontend, the only scenario where dockets do not contain nested documents is when the docket has no filings.

In all other scenarios where dockets are not empty, nested documents are always shown in the frontend, either by:

  • RECAPDocuments that matched because child documents also contain docket fields, so they matched by those fields even though RECAPDocument-specific fields are not matched.
  • If the query only contains a field specific to the Docket, like party or attorney, and if the result matched a docket with filings, even though the party query didn't match any RECAPDocument, up to 5 child documents are shown, matched by a match_all child query, selecting 5 documents from the docket. So I guess we should mimic the same behavior in the API, correct?

Note also that it's currently possible to set the type=d, and get back dockets as results, with no nested content. We should make sure we still support that, because some people just want to search dockets and get the performance boost of doing that.

Sure, I'll make sure that this functionality is preserved in the API so that type=d only returns Dockets without nested RECAPDocuments.

I think we should also add a feature to search just documents via a new type parameter, type=d. That'd be close to the current functionality, I think, but would just search documents. This can come in a future iteration if it's hard. What do you think?

Yes, it's possible to add a new search type that returns only RECAPDocuments. I think it'd be pretty straightforward. We already have a query that only matches RECAPDocuments used in the Feed. What would be the new type for this functionality, to differentiate it from the specific docket search type=d?

Do we have the count of nested items and can we just show that? I'd suggest we just do that, and if people want to search for those documents, they can use the type=d parameter I was just mentioning?

No :(, that's the issue we have in the frontend; it'd be too costly to get the exact count of child documents on every result on the page. So we can only know if there are more than 5 results; we request 6 child documents and if there are more than 5, we show the button for the constrained docket_id query in the frontend. We could maybe just indicate also that there are more than 5 nested results in the API maybe using a boolean?

Good questions. I don't think we should show the plaintext field, because it's too big to do for 10×5 results. That'd be a huge response, and the snippet is a good thing to show instead.

Yeah, you're right. So we'll show the snippet field with the plain_text content, which contains up to the first 500 chars if no HL is matched in the plain_text field or the snippet with the text highlighted.

I think we should use highlighting like we do on the front end (ie, all the fields), but it'd be nice if had to be opted into via a highlight=true parameter. That way, most people get fast API results, but if you need highlighting, you enable it and take the performance hit.

Great, that means that every field that supports HL in the frontend can also be highlighted in the API if the parameter highlight=true in the URL is set, right?

By default, if highlight is not passed, highlighting will be disabled, correct?

To disable HL completely and get the performance boost, we'd need to get rid of the no_match_size functionality that extracts the first 500 chars from the plain_text field (if not HL are matched) and instead get the first 500 chars from the DB when HL is disabled. Otherwise, HL won't be completely disabled from the plain_text field and it can still get HL.

albertisfu avatar Apr 11 '24 22:04 albertisfu

Re child docs:

I guess we should mimic the same behavior in the API, correct?

Yeah, it seems simplest. I could make arguments for different functionality, but I think it's best to keep it close to the front end so that people can swap a query from the front end to the API and basically get the same results.

I'm also hoping to use this API for the front end eventually. :)

What would be the new type for this functionality, to differentiate it from the specific docket search type=d?

Oops! I was thinking d for document and forgot (from 10 seconds ago) when d was dockets! Jeesh. Well, I think that means we do type=rd?

We could maybe just indicate also that there are more than 5 nested results in the API maybe using a boolean?

Yeah, that's right. Sorry my memory is so bad. Yes, this seems like the best solution. Perhaps more_docs=true?

Great, that means that every field that supports HL in the frontend can also be highlighted in the API if the parameter highlight=true in the URL is set, right?

Yes.

By default, if highlight is not passed, highlighting will be disabled, correct?

Yes.

To disable HL completely and get the performance boost, we'd need to get rid of the no_match_size functionality that extracts the first 500 chars from the plain_text field (if not HL are matched) and instead get the first 500 chars from the DB when HL is disabled.

Makes sense. Thank you!

mlissner avatar Apr 11 '24 23:04 mlissner

I think we can also close this one as all the V4 Search API endpoints have already been deployed (except for the parentheticals, but that one doesn't use join fields and still needs fixes in the frontend first).

albertisfu avatar Jun 06 '24 00:06 albertisfu

Hell yeah! Congratulations!

mlissner avatar Jun 06 '24 00:06 mlissner