juriscraper 325 attachment pages

PR #325 Adds Appellate Attachment Page parsing.

This was in a previous PR, but as is my unfortunate GIT life, I screwed up a rebase.

This PR was essentially finished imho, but I made a final doctoring tweak and updated typing and python3 decode encode issue from before the py3 upgrade.

Dec 05 '20 20:12 flooie

All committers have signed the CLA.

Dec 18 '21 00:12 CLAassistant

@mlissner I think we had so many ready to bump over the line and this one seems somewhat important.

Tests still pass and its been rebased. Looking forward to your thoughts on this one.

Dec 18 '21 18:12 flooie

@albertisfu can you take a look at this and see what it takes to get it over the line, assuming we didn't build appellate attachment page parsing somewhere else, since this PR was forgotten?

Dec 07 '22 20:12 mlissner

Yes, of course, I'll check this. Yeah, I checked and we don't have appellate attachment page built somewhere else.

Dec 07 '22 20:12 albertisfu

I've checked this and completed the requested changes above.

Added some new tests from other districts, it just needed some tweaks to work in ca5 attachment pages since it contains an extra column with checkboxes.

I also did some tests using this parser to merge appellate attachment pages in CL in order to confirm the JSON structure is correct, some questions came up:

Since appellate attachment pages don't have document numbers, I think the right way to assign the main document number to each recap document attachment is to get it from the main recap document, assuming that some courts use numbers and others the pacer_doc_id as numbers, does it seems good to you?
In order to identify the main recap document when merging attachments we use the main document pacer_doc_id, unlike district attachment pages where the pacer_doc_id can be found outside the attachment table. For appellate attachments, the main document is included within the same attachment table and sometimes it's not in the first row. But now we identify the main pacer_doc_id assuming that it's the number with the lower value, I checked some appellate dockets and their attachments pages, and seems that this is true.

Here's an example:

In this example, the docket entry has the pacer_doc_id: 009030936940 In the attachment page, attachment 2 is the one with the lower pacer_doc_id 009130936940 so this would be the main document.

So that the logic to select the main pacer_doc_id based on the lower value in this example works and a couple more that I checked. Do you know if it might exist an exception where the main pacer_doc_id is not the one with the lower value?
Merging this example in a test in CL, it looks like this:

Screen Shot 2022-12-09 at 15 17 47

I think we need to do some changes to how the main document and attachments are shown in CL for appellate since now the Main Document is a copy of Attachment 2.

So following the assumption that the main document is the one with the lower pacer_doc_id in appellate attachments pages, for this example, it should be shown like this?

Screen Shot 2022-12-09 at 15 26 22

Using the Attachment 2 as the Main document and renumbering the attachments?

Removed the underscored test examples, they were docket appellate, I assume used just during the development process.

Let me know what do you think.

Dec 09 '22 21:12 albertisfu

Thanks for the research and clean up.

I think the correct way to handle this is by mimicking the appellate website as closely as possible.

If appellate RECAP never has "main" documents, then CourtListener shouldn't have main documents either and should show it like this:

10	Really long description of the docket entry here, blah, blah. blah, blahblah, blahblah, blahblah, blahblah, blahblah, blahblah, blahblah, blah
	1 Docketing letter
	2 Mediation letter
	3 Case Opening Packet

And in that case, there's no main document for the row. This aligns with how the appellate website shows it too, and if you look at the top of your screenshot of the attachment page, it says, "3 documents are attached to this filing."

I would absolutely not re-number things. Whatever numbers are in appellate PACER are the numbers we must use.

You also said:

Since appellate attachment pages don't have document numbers,

But then I got lost, because they...do seem to have numbers? Or is there an example of a court that doesn't do them on the attachment pages that you can share? Maybe as a link so I can look too?

I'll take a look at the code next...

Dec 10 '22 00:12 mlissner

And in that case, there's no main document for the row. This aligns with how the appellate website shows it too, and if you look at the top of your screenshot of the attachment page, it says, "3 documents are attached to this filing."

Perfect, yeah this seems like the right approach, so yes when the appellate docket entry has multiple documents (attachments) we won't show the "Main document" row and only show the attachments rows as on PACER.

So in this scenario, when the appellate docket is uploaded and the docket entries are added. For each document entry, a "main" RECAPDocument is created containing the pacer_doc_id for the "main" document (that we'll use when an attachment page is uploaded to identify the docket entry and merge attachments), so in case the appellate docket entry has attachments we'll only hide the "Main document" row in the frontend but it always exists an empty RECAPDocument for that "main document" in DB.

But then I got lost, because they...do seem to have numbers? Or is there an example of a court that doesn't do them on the attachment pages that you can share? Maybe as a link so I can look too?

Yes, appellate attachment pages have numbers but they are the Attachment number, I was referring to the Document number that every RECAPDocument has, which in an attachment is the Document number from the main RECAPDocument/Entry Number.

Screen Shot 2022-12-09 at 19 20 12

So this Document number is the one that I think can be copied from the main RECAPDocument or set it as blank?

Thanks for your comments!

Dec 10 '22 01:12 albertisfu

I've submitted a PR in CL where this new parser is used: https://github.com/freelawproject/courtlistener/pull/2413 No more changes were needed here, so once this is merged, I can add a version bump and update the CL PR.

Dec 13 '22 20:12 albertisfu

juriscraper juriscraper copied to clipboard

325 attachment pages

juriscraper
juriscraper copied to clipboard