juriscraper
juriscraper copied to clipboard
325 attachment pages
PR #325 Adds Appellate Attachment Page parsing.
This was in a previous PR, but as is my unfortunate GIT life, I screwed up a rebase.
This PR was essentially finished imho, but I made a final doctoring tweak and updated typing and python3 decode encode issue from before the py3 upgrade.
@mlissner I think we had so many ready to bump over the line and this one seems somewhat important.
Tests still pass and its been rebased. Looking forward to your thoughts on this one.
@albertisfu can you take a look at this and see what it takes to get it over the line, assuming we didn't build appellate attachment page parsing somewhere else, since this PR was forgotten?
Yes, of course, I'll check this. Yeah, I checked and we don't have appellate attachment page built somewhere else.
I've checked this and completed the requested changes above.
Added some new tests from other districts, it just needed some tweaks to work in ca5
attachment pages since it contains an extra column with checkboxes.
I also did some tests using this parser to merge appellate attachment pages in CL in order to confirm the JSON structure is correct, some questions came up:
-
Since appellate attachment pages don't have document numbers, I think the right way to assign the main document number to each recap document attachment is to get it from the main recap document, assuming that some courts use numbers and others the
pacer_doc_id
as numbers, does it seems good to you? -
In order to identify the main recap document when merging attachments we use the main document
pacer_doc_id
, unlike district attachment pages where thepacer_doc_id
can be found outside the attachment table. For appellate attachments, the main document is included within the same attachment table and sometimes it's not in the first row. But now we identify the main pacer_doc_id assuming that it's the number with the lower value, I checked some appellate dockets and their attachments pages, and seems that this is true.Here's an example:
In this example, the docket entry has the
pacer_doc_id
:009030936940
In the attachment page, attachment 2 is the one with the lowerpacer_doc_id
009130936940
so this would be the main document.So that the logic to select the main
pacer_doc_id
based on the lower value in this example works and a couple more that I checked. Do you know if it might exist an exception where the mainpacer_doc_id
is not the one with the lower value? -
Merging this example in a test in CL, it looks like this:
I think we need to do some changes to how the main document and attachments are shown in CL for appellate since now the Main Document
is a copy of Attachment 2
.
So following the assumption that the main document is the one with the lower pacer_doc_id
in appellate attachments pages, for this example, it should be shown like this?
Using the Attachment 2
as the Main document
and renumbering the attachments?
- Removed the underscored test examples, they were docket appellate, I assume used just during the development process.
Let me know what do you think.
Thanks for the research and clean up.
I think the correct way to handle this is by mimicking the appellate website as closely as possible.
If appellate RECAP never has "main" documents, then CourtListener shouldn't have main documents either and should show it like this:
10 | Really long description of the docket entry here, blah, blah. blah, blahblah, blahblah, blahblah, blahblah, blahblah, blahblah, blahblah, blah |
---|---|
1 Docketing letter | |
2 Mediation letter | |
3 Case Opening Packet |
And in that case, there's no main document for the row. This aligns with how the appellate website shows it too, and if you look at the top of your screenshot of the attachment page, it says, "3 documents are attached to this filing."
I would absolutely not re-number things. Whatever numbers are in appellate PACER are the numbers we must use.
You also said:
Since appellate attachment pages don't have document numbers,
But then I got lost, because they...do seem to have numbers? Or is there an example of a court that doesn't do them on the attachment pages that you can share? Maybe as a link so I can look too?
I'll take a look at the code next...
And in that case, there's no main document for the row. This aligns with how the appellate website shows it too, and if you look at the top of your screenshot of the attachment page, it says, "3 documents are attached to this filing."
Perfect, yeah this seems like the right approach, so yes when the appellate docket entry has multiple documents (attachments) we won't show the "Main document" row and only show the attachments rows as on PACER.
So in this scenario, when the appellate docket is uploaded and the docket entries are added. For each document entry, a "main" RECAPDocument
is created containing the pacer_doc_id
for the "main" document (that we'll use when an attachment page is uploaded to identify the docket entry and merge attachments), so in case the appellate docket entry has attachments we'll only hide the "Main document" row in the frontend but it always exists an empty RECAPDocument
for that "main document" in DB.
But then I got lost, because they...do seem to have numbers? Or is there an example of a court that doesn't do them on the attachment pages that you can share? Maybe as a link so I can look too?
Yes, appellate attachment pages have numbers but they are the Attachment number
, I was referring to the Document number that every RECAPDocument
has, which in an attachment is the Document number from the main RECAPDocument/Entry Number.
So this Document number is the one that I think can be copied from the main RECAPDocument or set it as blank?
Thanks for your comments!
I've submitted a PR in CL where this new parser is used: https://github.com/freelawproject/courtlistener/pull/2413 No more changes were needed here, so once this is merged, I can add a version bump and update the CL PR.