juriscraper
juriscraper copied to clipboard
Scrape United States Supreme Court (SCOTUS)
New electronic docketing system was introduced at the United States Supreme Court this week. Unfortunately it didn't seem to change much of their existing public-facing system, but we should scrape the system for filings and index them.
Some structural notes:
-
There is no way to get a list of recent filings across all cases, such as a global RSS feed.
-
Each case has its own RSS fee, e.g. https://www.supremecourt.gov/rss/cases/17-74.xml. For some reason the timestamp seems to bump daily, even though there is no visible docket activity. Also, the timestamp is marked as EDT but really is in EST.
-
Each case has a JSON and XML report: http://www.supremecourt.gov/RSS/Cases/JSON/17-74.json http://www.supremecourt.gov/RSS/Cases/XML/17-74.xml
-
Although the above don't have proper timestamps for linke PDFs, the name of the link contains a timestamp in "YYYYMMDDhhmmss{milliseconds}" format. e.g.: http://www.supremecourt.gov/DocketPDF/17/17-74/19830/20171113205323142_17-71%2017-74%20Weyerhaeuser%20Co.%20Markle%20Interests%20LLC.pdf
@johnhawkinson I assume this is stale and no longer and issue.
I assume this is stale and no longer an[] issue.
Err…why would you assume that?
I think everything in the original PR remains true, and with appellate RECAP on the horizon, setting sights on the Supreme Court seems more achievable.
Do we not already collect everything? @johnhawkinson
This is about filings too, @flooie.
I have a working implementation that downloads the SCOTUS docket JSON feeds and stores them in Sqlite. It uses a subset of Juriscraper's dependencies. Happy to work on adapting this if you'd like a PR, but I will need design guidance on where to fit this into the package structure and workflows.
Ooooh, that's fun @ralexx. I think we'd want it to have a directory next to the pacer
directory.
With that in mind, do you want to suggest a class hierarchy, and perhaps @flooie or @grossir can help guide your work so it fits with the rest of our style/approach/architecture?
Sounds good, @mlissner.
Is this issue the best place to ask further questions?
Yep, it's a great place to discuss things.
I think we'd want it to have a directory next to the pacer directory.
I will work in juriscraper/juriscraper/scotus_docket
and juriscraper/tests/examples/scotus_docket
. You've used underscore naming for the oral_args/
directory so I'm following that, but I also noticed that pacerdocket.py
doesn't use an underscore so happy to fit whichever style.
My first design question is about how you (@flooie and @grossir) would prefer to handle obtaining the Supreme Court docket numbers with which to scrape dockets. The only two sources of truth I've found are PDF documents:
- The Journal of the Supreme Court, which appears to be a comprehensive listing of cases regardless of disposition.
- The Supreme Court "Granted & Noted List", which looks like a subset of cases where certiorari has been granted.
The PDFs are pretty clean and I've had reasonable success extracting docket numbers and even some unique metadata from the Granted & Noted List using pypdf
and regex. However, I'm not sure if you want to add pypdf
as a package dependency. And the more prose-like format of the Journal may be a bit more tricky to parse; I haven't attempted this as yet.
Instead I resorted to brute force and am just trying docket numbers sequentially (I am rate-limiting my requests to 1/sec). It's hacky but it works. The challenge from a resource-usage point of view is that there are discontinuities in the docket numbering, e.g. YY-2000 to YY-4999 seem largely unused but from YY-5000 to YY-12000 about 10% of docket numbers are used. I tried using search algorithms to reduce the search space of docket numbers but it wasn't worth the effort.
- Are you aware of any other comprehensive list of Supreme Court dockets?
- Absent that, do you prefer the PDF parsing approach or the brute force approach to identifying active dockets?
- If you prefer brute force, an easy way to shrink the docket number search space over time is excluding previously identified docket numbers. That would require persistent data storage that doesn't appear to be part of Juriscraper. Is this an approach you would like to investigate?
I have been looking into this and I need some clarification on the scope and integration into CL of this source @mlissner @flooie
What DB tables are we going to fill with this data?
It seems to me that this source contains mainly Docket
and DocketEntry
objects, and some related objects such as OriginatingCourtInformation
, Party
and some PDF Documents that belong to the DocketEntry
object. About the documents, an example highlighted in blue from here:
However, I think we do not create DocketEntry
from any source except RECAP. Should we change this here?
How are users interacting with this data?
I am guessing this is for setting up Docket Alerts, as in RECAP? If that's the case, the scraper should have 2 starting points:
- "discovery" of previously unknown dockets (which ties into @ralexx analysis)
- re-scraping known dockets (to fulfill alerts)
About the re-scraping, the RSS endpoint data is not static. If you check the example from the original comment, it has docket entries which are more recent than the comment itself.
Anyhow, I think we would need we will need to create a new caller on courtlistener to call the new juriscraper scraper and ingest this data, in the folder courtlistener/cl/scrapers/management/commands/
, something like cl_scrape_dockets.py
hi @ralexx thanks for looking into this.
About folder structure
I will work in juriscraper/juriscraper/scotus_docket
I think it should be juriscraper/dockets/united_states/federal_appellate/scotus.py
since we may want to support other docket scrapers in the future.
Also, this structure mirrors the one on juriscraper/opinions
and juriscraper/oral_args
I will work in ... juriscraper/tests/examples/scotus_docket
similarly, I think it would be better to mirror examples/opinions/united_states
and examples/oral_args/united_states
and use examples/dockets/united_states/
About getting the docket list...
I don't think we should brute force the search, we try to be as gentle as possible with the server, since we use a user agent that identifies us (except for a couple of sources).
I am going to limit myself to checking the case of "discovery" of new dockets, as mentioned in the previous comment.
... from Docket Search
page
There is a Docket Search page. The search string is used in a full text search, so if you query for February 2024
, it returns dockets that contain such string. For example:
"SET FOR ARGUMENT on Wednesday, February 21, 2024",
"February 14, 2024 United States Court of Appeals for the Eighth Circuit Feb 01 2024 Motion to direct the Clerk"
The search supports exact results operator "February 21, 2024"
, which returns fewer results. However, the search is limited to the last 5 years, and to 500 results, and has a page size of 5, which could make us do several requests.
... from PDFs
About using the PDFs as source of docket numbers, I think it is a valid idea. About adding the pypdf
dependency I don't know if @flooie is against it design wise, since we use doctor
to extract text from PDFs. Maybe we could send requests to doctor
from juriscraper?
About the Journals, they seem to be published / updated once each year, and the end of the term, so we wouldn't get fresh data from them (which is important if we are implementing this for alerting). They seem great for backscraping old cases.
The Granted and Noted List documents are updated more frequently, and we could indeed get the list from there. However, the "most recent" document October Term 2024 has an updated dated older than the one on the second document October Term 2023 (January 22, 2024 vs February 8, 2024), so we would have to check both
I think that if one of the use cases of this data is to alert users (previous comment), we should use the HTML Docket Search to collect the docket numbers, since by using a date-like query we will get recent and active cases, and do not limit ourselves only to the "Granted and Noted" list. What do you think @flooie?
@grossir have you looked at the json endpoints @johnhawkinson mentions.
About using the PDFs as source of docket numbers, I think it is a valid idea. About adding the
pypdf
dependency I don't know if @flooie is against it design wise, since we usedoctor
to extract text from PDFs. Maybe we could send requests todoctor
from juriscraper?
I will work with whatever PDF extractor you prefer; I’m not familiar with doctor
so I simply reached for something I have used.
About the Journals, they seem to be published / updated once each year
Actually the documents are refreshed with a lag of 2-4 weeks. If you look at eg. p. 306 of the 2023 journal (the last page as I write this) you can see it is dated Jan. 10, 2024. But still not frequently enough to be useful, as you noted.
I assume we should build this to begin adding dockets directly into CL @mlissner?
@ralexx - doctor is our open source micro service tool we use to process documents for Courtlistener.
Using full text search
It sounds as if belt-and-suspenders search for docket numbers is the way to go. @grossir you make a good point about the full-text search feature: I didn’t play with it much because it doesn’t expose the underlying database API, so one more layer of abstraction.
The full-text search definitely returns spurious matches for our purpose. We can reject them but we may still have to request the dockets and parse them to validate the docket entries. For example, on result from searching on “Feb|February 16, 2024” is
“Motion to extend the time to file a response is granted and the time is further extended to and including February 16, 2024, for all respondents.”
https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/23-402.html was updated on 2024-02-02 but the search is matching the entry’s text.
The SCOTUS server supports the If-Modified-Since HTML header so we may be able to avoid the parsing step.
I’m on vacation this week, when I get back I will test full text search against a sample of dockets in the Journal. If that search returns a superset of the dockets with new entries on a given date then I will work on filtering out spurious matches before I report back.
@ralexx @grossir
The SCOTUS orders content looks like it could be a comprehensive source for all docket numbers. After reviewing the order list and the miscellaneous orders that are regularly published, I believe we have a solid method to identify all docket numbers along with their corresponding JSON endpoints.
I put together a simple script and managed to extract 3,199 docket JSON endpoints for the 2023 Term from the SCOTUS Orders page. Unless I am missing something, I dont think we need to mess around with docket search endpoints and can just parse the docket numbers from the orders when they are published.
@flooie, take for example 23A745, Trump v. US (the immunity appeal from DC circuit). There hasn’t been any order on the case so I don’t think it appears on the Orders: there was an amicus brief filed on 2024-02-20 that I did not find in the Orders PDF for the same date.
Particularly in higher profile cases there are amicus briefs on the docket that may be of general interest; these tend to appear prior to an order (one exception appears to be an order denying amicus curiae leave to file). Relying on orders for docket numbers could leave a substantial lag before such dockets were scraped. I can test this next week.
Got it.
Sorry to be a bit slow chiming in. I've been utterly swamped with meetings this week. A few things come to my mind as the boss man 'round these parts:
Goals
It sounds like we're on a path to scrape all SCOTUS content. That would be phenomenal, and it's something we should have been doing since the day they stood up their open docket website.
If we are to do this, speed and completeness will be essential, because SCOTUS content constantly goes viral among journalists. It's embarrassing for something with a gazillion re-posts not to be in our system at all.
Architecture
-
Do we want to use our current tables in CL?
I'm always on the fence about this, and it's a huge question: If we're going to add filings to CL from dozens or hundreds of courts, do we do it in one schema or in dozens? When we did Los Angeles Superior Court, it got its own schema and its own django app, and that seemed good. I think it's generally the right way to go.
-
How aggressively do we scrape?
Normally we try to be gentle on court server, but this is SCOTUS and from what I've gathered they're the one court in the land that actually has some sort of scalable architecture. We should still be smart and kind, but I think we can crawl aggressively if needed to hit completeness and speed goals. I would advocate for the least aggressive approach that is guaranteed to work within a few minutes of content being posted.
-
Do we store state in Juriscraper?
No. Juriscraper is supposed to be a library that other systems can build on top of. It is usually a wrapper for downloading and parsing particular pages on court web pages. For example you tell it:
- search for this query
- download this docket
- grab that PDF
And it provides you with JSON, response codes, and PDF binaries. If you need to do something like crawl the docket number space (except for certain numbers you already know about), the storage for that should be up a level in CourtListener or whatever other system is calling Juriscraper.
-
What about folder structure in Juriscraper?
I'm sorry to ask for another tweak here, but I would suggest one tweak with what Gianfranco said above. He suggested
juriscraper/dockets/united_states/federal_appellate/scotus.py
. I'd suggest going one step further with something like:juriscraper/dockets/united_states/federal_appellate/scotus/dockets.py
. That will give you a module to add files for lots of different SCOTUS-related tools. -
What about crawling/parsing PDFs?
This is usually my last resort. I'd rather download 100 URLs around the clock in a brute-force manner than rely on some slow, fragile, and unreliable PDF to do the job. But I haven't looked at the possibilities here. If it's the only way, so be it. I leave choice of library up to y'all, but as Bill knows, I usually prefer PyMuPDF. I think the API is a bit worse for scraping in particular, but the speed is (much) better and it's well maintained.
Putting this all together...
What all this means when you put it together is that we'll need the following components:
- Juriscraper to scrape content and make it digestible
- A daemon of some kind to call Juriscraper and scrape content around the clock.
- Some database models, probably new ones, to store what we scrape.
- HTML and views to display the content.
- Integration into our APIs, bulk data, and search engines (somehow!).
I think that's it for my comments. Sorry this is a lot, but adding a new court is a fairly big task. I think it's worth doing though and I'm excited to have some momentum here.
How do you (all) feel about the SCOTUS scraper having access to persistent state, whether an underlying docket database or its own database?
As far as I have seen, this doesn't happen elsewhere in Juriscraper: scraped data is presented to the caller for it to handle. I presume this statelessness is a design feature, but I'd like to clarify that.
Based on what I've found so far, there appear to be efficiency gains in downloading dockets when it's possible to discriminate between docket numbers that are known good and the rest. I will elaborate once I have some data to show you, but if you want SCOTUS scrapers to be stateless as a design principle then I won't work on/include that part of my prototype code.
Yeah, I think it's best to keep Juriscraper stateless, but we'll certainly need that analysis when it comes to the calling code, so it can do that.
I've looked into the different sources and types of information available at <supremecourt.gov> to guide my planning. Please let me know if you see that I'm missing something.
Published docket information
These are parts of supremecourt.gov most likely to be useful here. There are some additional sources such as the Orders of the Court by Circuit that I have omitted because they are merely different presentation of information found elsewhere.
Click for information sources summary table
Source | Format | Completeness | Timeliness |
---|---|---|---|
Journal of the SCOTUS | Supposedly, every disposition at the Court. Includes Bar admissions and other cruft. | "New Journal entries are posted on this website about two weeks after the event." That is being optimistic: as of this writing on March 1, 2024, the last Journal entry for the 2023 term is dated January 10, 2024. | |
Orders of the Court | Unsigned orders, i.e. not including Opinions | "Regularly scheduled lists of orders are issued on each Monday that the Court sits, but 'miscellaneous' orders may be issued in individual cases at any time. Scheduled order lists are posted on this Website on the day of their issuance, while miscellaneous orders are posted on the day of issuance or the next day." | |
Granted/Noted Cases List | Mostly grants of certiorari, including decided cases (for which there are opinions). Very limited subset of cases. | Less than a week of lag; as of March 1, 2024 the document is dated as of February 28, 2024. | |
Opinions Relating to Orders | Only opinions that accompany select summary dispositions (typically dissents). These also appear at the back of the Orders documents. | "Any opinions...will be posted here on the day they are issued. " | |
Opinions of the Court | All cases decided by the full court have their opinions published here. | These appear to be posted the day the case is decided. | |
Docket Search | HTML, JSON, XML | As far as I can tell, it's all here. | Reflects the timeliness of the underlying dockets. |
Calendars and Lists: Argument Calendar | Contains docket numbers of cases scheduled for oral arguments. | Published 2-3 months prior to session (within a term) in which the arguments will be heard. | |
Calendars and Lists: Court Calendar | Waste of time unless you want to fine tune image recognition ML models. Easier to eyeball and transcribe; but can be useful for timing scrapes of Orders Lists, which appear "at 9:30 a.m. on the Monday following a Court conference, usually held three times a month when the Court is sitting". | Has all the key dates for a term. | |
Calendars and Lists: Hearing Lists | Published (late in the) prior week. | Somehow different from the Argument Calendar, I'm not clear on the distinction. Contains docket numbers of the cases to be heard on an upcoming day over the following week. | |
Calendars and Lists: Day Call | Apparently a daily update to the Argument Calendar, published the morning (ET) of the prior day. | Contains docket numbers of the cases to be heard that day. |
Docket pages
Email notification
...is available, but not if you're a 'bot. There is a graphical captcha interstitial to inhibit simply signing up for all dockets and receiving push notifications by email. Becase heaven forbid the unwashed masses might want what's theirs?
HTML pages
As returned by search queries: https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/23-175.html
Also available directly: https://www.supremecourt.gov/docket/docketfiles/html/public/23-175.html
These are the pages linked to by the full-test docket search feature. They further contain links to JSON and XML representations of what I believe is the identical data. Not surprisingly I have not found systematic (or any, so far) discrepancies between HTML and JSON representations of dockets.
These pages are updated intra-day.
JSON pages
Found after navigating to RSS feeds of dockets: http://www.supremecourt.gov/RSS/Cases/JSON/23-175.json
XML representations are found by substituting ../XML/
Contains all docket information. Updated intra-day.
Example: docket 23-939
{"CaseNumber":"23-939 ","bCapitalCase":false,"sJsonCreationDate":"03/01/2024","sJsonTerm":"2023","sJsonCaseNumber":"00939","sJsonCaseType":"Paid","RelatedCaseNumber":[],"PetitionerTitle":"Donald J. Trump, Petitioner","RespondentTitle":"United States","DocketedDate":"February 28, 2024","Links":"Linked with 23A745","LowerCourt":"United States Court of Appeals for the District of Columbia Circuit","LowerCourtCaseNumbers":"(23-3228)","LowerCourtDecision":"February 6, 2024","QPLink":"../qp/23-00939qp.pdf","ProceedingsandOrder":[{"Date":"Feb 12 2024","Text":"Application (23A745) for a stay, submitted to The Chief Justice.","Links":[{"Description":"Main Document","File":"2024-02-12 - US v. Trump - Application to S. Ct. for Stay of D.C. Circuit Mandate - Final With Tables and Appendix.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300410/20240212154110541_2024-02-12%20-%20US%20v.%20Trump%20-%20Application%20to%20S.%20Ct.%20for%20Stay%20of%20D.C.%20Circuit%20Mandate%20-%20Final%20With%20Tables%20and%20Appendix.pdf"},{"Description":"Proof of Service","File":"2024-02-12 - Certificate of Service for Stay Application.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300410/20240212154123465_2024-02-12%20-%20Certificate%20of%20Service%20for%20Stay%20Application.pdf"}]},{"Date":"Feb 12 2024","Text":"Petition for a writ of certiorari filed."},{"Date":"Feb 13 2024","Text":"Response to application (23A745) requested by The Chief Justice, due February 20, 2024, by 4pm (EST)."},{"Date":"Feb 13 2024","Text":"Brief amicus curiae of Jon Danforth, J. Michael Luttig, Carter Phillips, Peter Keisler, Larry Thompson, Stuart Gerson, et al. filed.","Links":[{"Description":"Main Document","File":"2024-2-13 Amici Curiae Brief Opposing Application for Stay.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300472/20240213120356911_2024-2-13%20Amici%20Curiae%20Brief%20Opposing%20Application%20for%20Stay.pdf"},{"Description":"Proof of Service","File":"Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300472/20240213120405955_Proof%20of%20Service.pdf"}]},{"Date":"Feb 13 2024","Text":"Brief amicus curiae of Constitutional Law Scholars filed.","Links":[{"Description":"Main Document","File":"Trump v. US CAC Scholars Brief.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140115053_Trump%20v.%20US%20CAC%20Scholars%20Brief.pdf"},{"Description":"Certificate of Word Count","File":"Trump v. US CAC Cert Compliance.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140124180_Trump%20v.%20US%20CAC%20Cert%20Compliance.pdf"},{"Description":"Proof of Service","File":"Trump v. US CAC Cert of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140133124_Trump%20v.%20US%20CAC%20Cert%20of%20Service.pdf"}]},{"Date":"Feb 14 2024","Text":"Response to application from respondent United States filed.","Links":[{"Description":"Main Document","File":"23A745_Trump v. United States_Gov. stay resp_FINAL.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300627/20240214180323991_23A745_Trump%20v.%20United%20States_Gov.%20stay%20resp_FINAL.pdf"},{"Description":"Proof of Service","File":"23A745 - Trump v USA Certificate.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300627/20240214180338307_23A745%20-%20Trump%20v%20USA%20Certificate.pdf"}]},{"Date":"Feb 14 2024","Text":"Brief amicus curiae of Protect Democracy Project filed.","Links":[{"Description":"Main Document","File":"23A745 Trump v. USA_Amicus Brief.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300593/20240214142544324_23A745%20Trump%20v.%20USA_Amicus%20Brief.pdf"},{"Description":"Proof of Service","File":"23A745 Trump v. USA_Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300593/20240214142520222_23A745%20Trump%20v.%20USA_Proof%20of%20Service.pdf"}]},{"Date":"Feb 15 2024","Text":"Reply of applicant Donald J. Trump filed.","Links":[{"Description":"Reply","File":"2024-02-15 - 23A745 - Reply iso Application to S. Ct. for Stay of D.C. Circuit Mandate.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300749/20240215174027604_2024-02-15%20-%2023A745%20-%20Reply%20iso%20Application%20to%20S.%20Ct.%20for%20Stay%20of%20D.C.%20Circuit%20Mandate.pdf"},{"Description":"Proof of Service","File":"2024-02-15 - Certificate of Service for Reply iso Stay Application.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300749/20240215174038799_2024-02-15%20-%20Certificate%20of%20Service%20for%20Reply%20iso%20Stay%20Application.pdf"}]},{"Date":"Feb 15 2024","Text":"Brief amicus curiae of David Boyle filed.","Links":[{"Description":"Main Document","File":"23A745_tsac_DavidBoyle.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300655/20240215114857700_23A745_tsac_DavidBoyle.pdf"}]},{"Date":"Feb 16 2024","Text":"Brief amicus curiae of Alabama and 21 Other States filed.","Links":[{"Description":"Main Document","File":"States Brief in Trump v US FINAL 2.16.24.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300793/20240216132806756_States%20Brief%20in%20Trump%20v%20US%20FINAL%202.16.24.pdf"},{"Description":"Proof of Service","File":"Certificate of Service for States Br. FINAL.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300793/20240216132818877_Certificate%20of%20Service%20for%20States%20Br.%20FINAL.pdf"}]},{"Date":"Feb 19 2024","Text":"Brief amicus curiae of Christian Family Coalition filed.","Links":[{"Description":"Main Document","File":"23A745 Amicus CFC.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151217308_23A745%20Amicus%20CFC.pdf"},{"Description":"Certificate of Word Count","File":"CERTIFICATE OF COMPLIANCE.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151223585_CERTIFICATE%20OF%20COMPLIANCE.pdf"},{"Description":"Proof of Service","File":"CERTIFICATE OF SERVICE.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151229160_CERTIFICATE%20OF%20SERVICE.pdf"}]},{"Date":"Feb 19 2024","Text":"Brief amicus curiae of Jeremy Bates filed.","Links":[{"Description":"Main Document","File":"amicus brief oppn to stay 2 19 2024.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300877/20240219152614536_amicus%20brief%20oppn%20to%20stay%202%2019%202024.pdf"},{"Description":"Proof of Service","File":"COS amicus brief oppn to stay 2 19 2024 .pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300877/20240219152631553_COS%20amicus%20brief%20oppn%20to%20stay%202%2019%202024%20.pdf"}]},{"Date":"Feb 20 2024","Text":"Brief amicus curiae of Former Attorney General Edwin Meese III, Law Professors Steven Calabresi and Gary Lawson, and Citizens United filed.","Links":[{"Description":"Main Document","File":"Trump v US Stay Amicus Final.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173530766_Trump%20v%20US%20Stay%20Amicus%20Final.pdf"},{"Description":"Certificate of Word Count","File":"Certificate Word Count.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173536501_Certificate%20Word%20Count.pdf"},{"Description":"Proof of Service","File":"Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173542733_Proof%20of%20Service.pdf"}]},{"Date":"Feb 28 2024","Text":"Application (23A745) referred to the Court."},{"Date":"Feb 28 2024","Text":"Petition GRANTED."},{"Date":"Feb 28 2024","Text":"The application for a stay presented to The Chief Justice is referred by him to the Court. The Special Counsel’s request to treat the stay application as a petition for a writ of certiorari s granted (23-939), and that petition is granted limited to the following question: Whether and if so to what extent does a former President enjoy presidential immunity from criminal prosecution for conduct alleged to involve official acts during his tenure in office. Without expressing a view on the merits, this Court directs the Court of Appeals to continue withholding issuance of the mandate until the sending down of the judgment of this Court. The application for stay is dismissed as moot. \r\n The case will be set for oral argument during the week of April 22, 2024. Petitioner’s brief on the merits, and any amicus curiae briefs in support or in support of neither party, are to be filed on or before Tuesday, March 19, 2024. Respondent’s brief on the merits, and any amicus curiae briefs in support, are to be filed on or before April 8, 2024. The reply brief, if any, is to be filed on or before 5 p.m., Monday, April 15, 2024."},{"Date":"Feb 29 2024","Text":"Record requested from the United States Court of Appeals for the District of Columbia Circuit."},{"Date":"Mar 01 2024","Text":"Record received electronically from the United States Court of Appeals for the District of Columbia Circuit and available with the Clerk."},{"Date":"Mar 01 2024","Text":"Record received from the United States District Court for the District of Columbia. The record is electronic and is available on PACER."}],"AttorneyHeaderPetitioner":"Attorneys for Petitioner","Petitioner":[{"Attorney":"D. John Sauer","IsCounselofRecord":true,"Title":"James Otis Law Group, LLC","PrisonerId":null,"Phone":"314-562-0031","Address":"13321 North Outer Forty Road\r\nSuite 300","City":"St. Louis","State":"MO","Zip":"63017","Email":"[email protected]","PartyName":"Donald J. Trump"},{"Attorney":"D. John Sauer","IsCounselofRecord":true,"Title":"James Otis Law Group, LLC","PrisonerId":null,"Phone":"314-562-0031","Address":"13321 North Outer Forty Road\r\nSuite 300","City":"St. Louis","State":"MO","Zip":"63017","Email":"[email protected]","PartyName":"President Donald J. Trump"}],"AttorneyHeaderRespondent":"Attorneys for Respondent","Respondent":[{"Attorney":"Michael R. Dreeben","IsCounselofRecord":true,"Title":"Counselor to the Special Counsel","PrisonerId":null,"Phone":"202-305-9654","Address":"Department of Justice\r\n950 Pennsylvania Ave, NW","City":"Washington","State":"DC","Zip":"20530","Email":"[email protected]","PartyName":"United States"}]}
Docket numbering
There are four docket types but all use a consistent numbering format: \d\d[-AMO]\d{1,5}
. I have found four types of dockets based on their identifier symbol:
- Petition (regular) dockets:
'-'
- Application dockets:
'A'
- Motion dockets:
'M'
- Original(?) dockets:
'O'
In the JSON presentations of dockets, the 'sJsonCaseNumber' field is zero-padded to the full five digits, e.g. for docket 23-2 that field contains '00002'. I believe this is the only use of zero padding in the docket numbers found on the SCOTUS site, probably to avoid lexical sorting in SCOTUS's docket database.
Petition (regular) dockets
Petitions for writ of certiorari are given these docket numbers, e.g. 23-175.
The SCOTUS Public Information Office offers some guidance here:
All cases receive a docket number upon filing in the Clerk's Office, ranging from 3 to 7 digits (e.g., 21–1, 21–2000).
The term In Forma Pauperis (IFP) describes permission given to an indigent to proceed without liability for Court fees or costs.* "Pauper" cases are always given up to a 7–digit number with the last digits up to the 10,000 or 11,000 series (e.g., 21–5661, 21-10269).
- IFP cases seem to number consistently starting at
<YY>-5001
. - Docket numbering between 2000-5000 is not consistent; I don't have a sense of any pattern here.
- Dockets 5001 and up appear most likely to have few docket entries, and seem most likely to be summarily denied writs of certiorari.
Application dockets
Applications to extend the time to file petitions for certiorari; or for stays pending disposition of petitions for certiorari; possibly other types of actions that I haven't found yet.
When successful, these cases become petition dockets with regular docket numbers. However, there does not appear to be any link from the A
docket to the -
docket; only in the other direction. Thus petition docket 23-232 has a reference back to the application 23A2 that preceded it,
{...,
Links: "Linked with 23A2",
...}
but there is no reference in 23A2 to 23-232.
Motion dockets
Movants often ask for leave to proceed with their applications as veterans. I'm not clear on what this status accords movants, but it seems to offer some distinction from Applications.
"Orig." dockets
I have found one case, 22O141 (Texas v. New Mexico and Colorado) that is listed in the Granted/Noted List as "Orig. 141". Maybe there are more? But I've included this identifier in the regex patterns.
Progress update
Constrained by lack of a central index
The lack of a single source of truth about SCOTUS dockets looks like it will be the major determinant of how scraping can proceed.
Unfortunately, the simplest solution has been intentionally placed out of reach. SCOTUS's own docket activity email notification system is protected with the ominous-sounding BotDetectCaptcha scripts.
That leaves us needing two dimensions of information: where to look for docket information, and when/how often to look there.
When to look
If handling state will be up to the scrapers' caller, as @mlissner suggested above, I would like to leave this part for later, or for others.
Where to look
So far I have tried three approaches:
- Brute-force queries of docket number ranges.
- Extracting docket numbers from the Granted/Noted List and from Orders of the Court, both of which come as PDFs.
- The Docket Search page, as pointed out by @grossir. It returns URLs to docket HTML pages; I then regex match the docket numbers in the URLs and scrape the corresponding JSON dockets.
My sense is that some combination of Docket Search query results and brute force will be closest to optimal in terms of server resource use, completeness, and timeliness.
Docket Search
Because the date strings for docket entries in the JSON presentation use a consistent format (strftime "%b %d, %Y"
), search results on those date strings are generally fairly accurare (true positives / total).
Click for search accuracy table
date_string | spurious | all | accuracy | |
---|---|---|---|---|
0 | 2024-03-01 | 130 | 165 | 0.212121 |
1 | 2024-02-29 | 38 | 147 | 0.741497 |
2 | 2024-02-28 | 28 | 127 | 0.779528 |
3 | 2024-02-27 | 26 | 96 | 0.729167 |
4 | 2024-02-26 | 31 | 203 | 0.847291 |
5 | 2024-02-25 | 2 | 5 | 0.6 |
6 | 2024-02-24 | 0 | 6 | 1 |
7 | 2024-02-23 | 30 | 200 | 0.85 |
8 | 2024-02-22 | 15 | 141 | 0.893617 |
9 | 2024-02-21 | 16 | 130 | 0.876923 |
10 | 2024-02-20 | 15 | 450 | 0.966667 |
11 | 2024-02-19 | 6 | 19 | 0.684211 |
12 | 2024-02-18 | 2 | 4 | 0.5 |
13 | 2024-02-17 | 3 | 6 | 0.5 |
14 | 2024-02-16 | 33 | 486 | 0.932099 |
15 | 2024-02-15 | 18 | 171 | 0.894737 |
16 | 2024-02-14 | 13 | 112 | 0.883929 |
17 | 2024-02-13 | 12 | 84 | 0.857143 |
18 | 2024-02-12 | 22 | 124 | 0.822581 |
19 | 2024-02-11 | 8 | 12 | 0.333333 |
20 | 2024-02-10 | 2 | 10 | 0.8 |
21 | 2024-02-09 | 16 | 130 | 0.876923 |
22 | 2024-02-08 | 12 | 173 | 0.930636 |
23 | 2024-02-07 | 9 | 135 | 0.933333 |
24 | 2024-02-06 | 11 | 83 | 0.86747 |
25 | 2024-02-05 | 12 | 100 | 0.88 |
26 | 2024-02-04 | 2 | 3 | 0.333333 |
27 | 2024-02-03 | 2 | 7 | 0.714286 |
28 | 2024-02-02 | 10 | 115 | 0.913043 |
29 | 2024-02-01 | 9 | 176 | 0.948864 |
Here I classified as spurious any docket in the search results whose whose 'Last-Modified' header or whose last docket entry date was less than the date given by the search string.
Other benefits from using Docket Search:
- Dockets are updated intraday, and I believe (more testing needed) that these updates are captured by searches in real time.
- The docket numbers used in the URLs returned by the Search don't have text artifacts like PDFs do. I believe that regex matching on those URLs is faithfully capturing all docket numbers returned by each search.
PDF extraction
What @mlissner said. I used pymupdf and it seems to give decent results, but the text artifacts it returns (e.g. atypical Unicode dashes U+2010,...,U+2014) means a fair bit of trial and error on regex patterns just for docket numbers.
Combined with the idiosyncrasies of publication times for the various documents, I think sources like the Orders pages (as @flooie pointed out) can be useful for backscraping and making sure that higher-profile dockets have been updated. But they don't update frequently enough to be useful for real-time scraping.
Brute force
Assume -- in the absence of testing -- that Docket Search results are precise and false negatives (dockets updated but not reflected in the search results) are low. That would leave brute force searches largely for discovery of new dockets, particularly the A/M/O types.
The good support for 'If-Modified-Since' request headers on <supremecourt.gov> is encouraging. I've found the status code 304 behavior of not downloading assets speeds up both Docket Searches and downloads of known-good docket numbers. More of a "light touch" than brute force.
Upstream questions
I have the core of @mlissner's first action item,
Juriscraper to scrape content and make it digestible
and I'm working on making the code more robust to network and parsing errors.
One area where I need your input is with the database model(s) that @mlissner mentioned. I haven't tried to write a full docket parser since I don't know what the caller's data requirements will be. But I will need to turn to that.
Another question I have is about object APIs. Should I assume that the SCOTUS docket parser will called from the command line as python -m
, such that object interfaces matter less than what's in argparse
and main()
? Or should I be trying to make objects conform to existing API patterns e.g. AppellateDocketReport
?
Thanks for all this work and detail!
A couple thoughts:
-
I think we should just assume we can sign up for email updates. I just wrote a note to SCOTUS asking for help doing this. If they help us, great. If not, let's bust captchas and sign up for email alerts.
There are APIs for busting captchas now, and I think we should put them on the table, if we need them. If SCOTUS doesn't reply to my message, I think an analysis of available captcha-busting APIs would be really useful as a first step here (maybe in a separate issue?).
-
If we can get email updates, does that help us learn about new cases or does that only help with cases we already know exist?
-
I love that you're optimizing things by using their search tool, but I feel like brute force is a more reliable method, no? Could we imagine an algorithm that explores the docket number space to identify which dockets exist? Then, once we know a docket exists, we could subscribe to it for email updates?
-
If we do the email update route, the architecture changes a bit. We'll want:
- An email address that can be used for this. (Maybe we use https://recap.email for this, and set up [email protected]?)
- When that email comes into AWS, it'll send an HTTP post to our API, so we'll need an API endpoint to handle that. We might be able to use the API endpoint we currently use for recap.email, but it's probably better to set up a different one.
- Then, once the API is hit, we will need a parser for the email that can at least get the docket number, so we can respond scraping the website for the latest info. We could also try to scrape other details from the email, but it usually gets painful fast.
Does that sound right?
Email notifications
There are APIs for busting captchas now, and I think we should put them on the table, if we need them.
I would not want to open up that can of worms myself. Leaving aside the poetic irony of doing that to the SCOTUS, you will want/need an obfuscation layer for IP addresses, user agent string, etc; and if their admins decide that '[email protected]' suddenly appearing on 40K-ish dockets is not consistent with human effort, you could be back to square one. That could devolve into Mutually Assured Whac-A-Mole.
Other approaches, including what I've described, are going to be sub-optimal but they can be complete. I can continue to work on those but I'm not going to mess with captcha defeats.
Docket search
I love that you're optimizing things by using their search tool, but I feel like brute force is a more reliable method, no?
Agreed. But after looking into it, I think @grossir was right to point out the search interface as a viable tool. In a nutshell, search interface for fast but possibly incomplete results, combined with brute force sequential downloads for slow but complete results.
As mentioned I think issues of search timing aren't blockers for scraping at this point. But if we take email notifications as the gold standard, there's already a modest lag between content arriving at supremecourt.gov and the notifications going out. See for example this notification I received:
"Amicus brief of United States..."
Return-path: <0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@amazonses.com>
Received: from [IPA] (helo=mailfront20.[server])
by delivery05.[server] with esmtp (Exim 4.86_2)
id 1rhI2u-0006v5-EI
for [me]@[server].com; [timestamp]
Received: from exim by mailfront20.[server] with sa-scanned (Exim 4.93)
id 1rhI2t-008ah8-RM
for [me]@[server].com; [timestamp]
[...snip...] envelope-from=0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@amazonses.com; helo=a65-150.smtp-out.amazonses.com
Received: from a65-150.smtp-out.amazonses.com
by mailfront20.[server] with esmtps (TLS1.2:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_CBC__SHA1:128)
(Exim 4.93)
id 1rhI2k-008ae1-5C
for [me]@[server].com; [timestamp]
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple;
s=sglce367a3eekz5cgo2jpvwdiom4ooya; d=sc-us.gov; t=1709596104;
h=From:To:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID:Date;
bh=28APxLDdZGKYN8cytr+CtYFVI2hMZwtFrwWODzxt9Ug=;
b=XKzxz0kZ9d6WlgHjyJLkc7DBW9V5daLi8yTCiItcWh28DrC1ywbbTAmWpLqQ56pV
8EOOk2itFhQujXkdijeMql/eM3GUQRSjnDJtfhKluWmoW2xupKRRbYq7/mw7otNSHg0
IknCxfbKCXI8ENOchtz7DiDnana1QvPbligBLP24=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple;
s=224i4yxa5dv7c2xz3womw6peuasteono; d=amazonses.com; t=1709596104;
h=From:To:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID:Date:Feedback-ID;
bh=28APxLDdZGKYN8cytr+CtYFVI2hMZwtFrwWODzxt9Ug=;
b=SJWIgiOt/V8jsRjxpv50sS9YGlkCZTLjUlkAZxnb0S+mi9XmvFDjSl72VTekjf45
pcc8t/PCC27u847OPbrF/EmNlkQysSxbMTjOl8q/s5DJF9Jslhm80PrOHGR/uv4AYd6
oV8S2BGXitN/SrkbmdpTopTq1GDicECgd8pyqyfY=
From: [email protected]
To: [me]@[server].com
Subject: Supreme Court Electronic Filing System
MIME-Version: 1.0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
Message-ID: <0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@email.amazonses.com>
Date: Mon, 4 Mar 2024 23:48:24 +0000
Feedback-ID: 1.us-east-1.+5GeZMB3eXeyv3WY8brP46tghxJpXFIF9yDDvLuTQrk=:AmazonSES
X-SES-Outgoing: 2024.03.04-54.240.65.150
A new docket entry, "Amicus brief of United States submitted." has been added for <a href='https://www.supremecourt.gov/search.aspx?filename=/docket/DocketFiles/html/Public/23-175.html'>City of Grants Pass, Oregon, Petitioner v. Gloria Johnson, et al., on Behalf of Themselves and All Others Similarly Situated</a>. You have been signed up to receive email notifications for No. 23-175. <br><br> If you no longer wish to receive email notifications on this case, please
<a href='https://file.supremecourt.gov/Request/NotificationOptOutGet?id=CfDJ8LWjh78o-U5EigyPTWy9BmcYx3Mv5n9kVSbUZvEjTlHeuXmoUJm4CgkH4w-SoXRjva81LlpVk5vw_tiAJt5Gcd2uexHzkPCAYwm-ByGsJxc4lBW3vbKTJljYr1BXC-6TnboxUcviCs05dpHQOBBmkT0'>click here</a>.
Look at the notification lag by comparing the email time stamp header (Date: Mon, 4 Mar 2024 23:48:24 +0000
) with the JSON docket header (last-modified: Mon, 04 Mar 2024 23:42:46 GMT
), and lastly with the time stamp portion of the docket filing URL http://www.supremecourt.gov/DocketPDF/23/23-175/302264/20240304183726571_23-175npUnitedStates.pdf (i.e. 2024-03-04-T18:37:26, which should be EST so 23:37 GMT). Around five minutes between the docket item hitting the web site and the notification email arriving.
Compare that with scraping results from the docket search interface. I just ran a test using the string 'Feb 3, 2024', including sending the 'If-Modified-Since' filter for 2024-03-04T00:00:00, and I was able to download the 174 valid dockets (i.e. dockets actually updated on/after 2024-03-04) out of 198 search results in one minute and 37 seconds. Multi-threaded or asyncio performance would obviously be even better although I believe httpx
integration is a separate issue in your project. Running this job every, say, five minutes wouldn't be a huge tradeoff to avoid messing with captchas, I hope.
Could we imagine an algorithm that explores the docket number space to identify which dockets exist? Then, once we know a docket exists, we could subscribe to it for email updates?
This is what the petition (i.e. regular) docket number search space looks like after a few days' worth of brute force attempts (X axis is SCOTUS term, Y axis is the sequential integer portion of the docket number):
There is just a single Y axis in the graph; what appears to be the bottom series is actually the "pauper" case numbers that seem to begin strictly at 5000, as described in my earlier note. For years 2018+ the numbering is not perfectly contiguous but close.
Unfortunately, that 'close' is too sparse for e.g. bisection or ternary searches; I tried. Instead for docket number discovery I have been doing sequential searches on the 1-2500 and 5000+ ranges, excluding known good docket numbers, and limiting the number of "Not Found" page results before the search quits. Slow, but effective. And embarrassingly parallel. It just requires known good docket numbers as state.
This is great, thank you. Two questions come to mind if we don't want to do the Capcha thing and the court doesn't get back to us (I just emailed again).
-
How often would it be practical to update the dockets if we have to crawl them all and we do it in parallel?
-
Am I gathering that doing the brute force approach without doing search is viable? It seems more reliable to me.
-
Is there a way to know that we can stop searching a particular case? Something to indicate that it's completed?
Sorry for the naive questions. I'm really leaning on your help, but I really appreciate the research you're doing!
I mean - we could just manually subscribe to each one - or all the important cases? its not ideal but considering the source would be appropriate I think.
You mean with a human instead of a captcha buster?
I am
I think I'd rather automate it, even if that involves busting captchas or scanning their website every few minutes for updates. Let's see what @ralexx thinks about the scanning idea (they said it was "embarrassingly parallel," which is promising), but my general thought is we shouldn't set ourselves up to have to do things as humans, because that scales poorly and we're bad at it.