courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Automatically download Document Selection Menus

Open Pascal666 opened this issue 6 years ago • 15 comments

Docket entries with attachments listed will have a Document Selection Menu listing all of the attachments with descriptions, like https://ecf.mtd.uscourts.gov/doc1/1110815290 and https://ecf.nysd.uscourts.gov/doc1/127020983456 . These documents are free. It would be nice if CL automatically downloaded these documents and added them to the docket whenever RECAP uploads a docket.

Pascal666 avatar Jul 11 '18 06:07 Pascal666

Funny, for the first time ever we've got all the tools to do this as of...Sunday. Only catch is that we get a LOT of content each day, so this would queue up a lot of requests to PACER. Still, it's useful info — probably worth doing.

mlissner avatar Jul 11 '18 06:07 mlissner

Devil's advocate: Practically speaking, why is this important? Most of the information there is in the docket report already, is it not? What's the case for this?

johnhawkinson avatar Jul 11 '18 12:07 johnhawkinson

You don't get a ton of info, but you do get:

  • Page counts — Having the page count is pretty useful because you can use it to estimate costs when doing bulk work.

  • File size (except in bankruptcy) — Not that useful, but we're gathering it now. Looks like it can be used to estimate whether something is a scanned doc before downloading it.

  • Document ID — Might help protect against orphan documents some of the time if we proactively have this?

  • Short description — I find this to be useful sometimes because it can be a more accurate representation of the document. For example, if you have a docket entry that mentions the word "invoice" you won't know if there's actually an invoice attached to it. If you have an attachment page mentioning "invoice" you can be pretty sure. (This is an actual example when I needed to crawl thousands of attachment pages.)

  • Existence — This might not be the biggest thing, but at a glance I much prefer to see this:

    screenshot_2018-07-11 docket for kessler v city of charlottesville 3 18-cv-00015 - courtlistener com

    Than this:

    screenshot_2018-07-11 docket for kessler v city of charlottesville 3 18-cv-00015 - courtlistener com

    The former tells me a lot more visually speaking.

I'd be curious to get @Pascal666's perspective here too.

It'd be nice to have a heuristic for when we decide to do this. Say we have a system for gathering attachment pages. When do we decide that we care enough about a docket to begin doing so?

  • Every docket all the time?
  • Ones that have been seen more than x times?
  • Big ones?
  • Something else?

mlissner avatar Jul 11 '18 16:07 mlissner

When viewing a docket as in mlissner's second photo, there is no clear way to even get to the Document Selection Menu. I finally figured it out, but before that I was paying for the docket at PACER in order to get a link to the Document Selection Menu. This is the problem that made me file this issue in the first place.

Having CL download them automatically for all dockets would also obviate the need for RECAP to do it so issues like https://github.com/freelawproject/recap/issues/238 could be closed.

Pascal666 avatar Jul 11 '18 21:07 Pascal666

Be careful of cases like document 51 at https://www.courtlistener.com/docket/4302680/deering-v-centurytel-inc/

Once you have the cost of downloading a document, it would be nice to have a page where a user can type in how much money they want to spend on PACER (like the $15 everyone gets free) and receive a list of documents they can download for RECAP that will add up to that total cost.

Pascal666 avatar Jul 11 '18 23:07 Pascal666

@mlissner:

I remain skeptical at the merits of having CL autoscrape anything in response to a RECAP upload. Certainly we need to have a privacy conversation about that, it's another example of ways the courts could identify who is using RECAP and who is not, which right now is not something they can do exclusively by looking at traffic to the ECF servers.

I tend to think that if we are going to scrape stuff, scraping iquery.pl for case metadata is more important. Not that they're mutually exclusive in an any way.

Document ID — Might help protect against orphan documents some of the time if we proactively have this?

This sounds like a crutch to avoid fixing the real problem. 👎

Existence — This might not be the biggest thing, but at a glance I much prefer to see this:

This is because we don't know how to parse the parenthesized attachment list. I think that would be a higher priority fix (and an easy one!).

@Pascal666:

When viewing a docket as in mlissner's second photo, there is no clear way to even get to the Document Selection Menu. I finally figured it out, but before that I was paying for the docket at PACER in order to get a link to the Document Selection Menu. This is the problem that made me file this issue in the first place.

I'm not sure what you're getting at here ("I finally figured it out"). Do you mean "View Document" or constructing the URL by hand or what?

Having CL download them automatically for all dockets would also obviate the need for RECAP to do it so issues like freelawproject/recap#238 could be closed.

It would not. CL is never going to scrape instantaneously so it's always good to parse as much data as we can when we get it.

Be careful of cases like document 51 at https://www.courtlistener.com/docket/4302680/deering-v-centurytel-inc/

What is the specific concern? That attachment pages (that's what we call these) are not available for some documents because of sealing? OK.

johnhawkinson avatar Jul 12 '18 01:07 johnhawkinson

Certainly we need to have a privacy conversation about that, it's another example of ways the courts could identify who is using RECAP and who is not, which right now is not something they can do exclusively by looking at traffic to the ECF servers.

Good point, but I think you vastly overestimate how hard it is to determine if someone is running RECAP. I created http://23.238.17.229/recap.html in about 10 minutes. Probably only works on Chrome.

I'm not sure what you're getting at here ("I finally figured it out"). Do you mean "View Document" or constructing the URL by hand or what?

In mlissner's first screenshot it is obvious how to get the attachments. In the second, you have to click the down arrow to the right of "Download PDF" and select the third option, "Buy on PACER", to get to the free Document Selection Menu.

What is the specific concern? That attachment pages (that's what we call these) are not available for some documents because of sealing? OK.

I probably could have phrased that better, no real concern, just giving an example of an error page to check for (assuming some type of error handling and retries are implemented).

Pascal666 avatar Jul 12 '18 06:07 Pascal666

Re privacy: This is worth thinking about, but I don't know that it's a big issue. We get content from a lot of places. We could also turn this on for RSS if we wanted to. Who knows why we're scraping a particular attachment page?

Re this helping the orphan problem: I still don't know what the fix for that is so long as we're getting documents before dockets (I expect this will continue forever). So in my mind, anything we can do to address that is a win.

(Summarizing for @Pascal666: We get documents sometimes and don't know how to associate them with a docket because we never got the docket info for the associated docket. When this happens, we call it an orphan document because it has no parent and we wait and hope the docket is uploaded some day.)

This is because we don't know how to parse the parenthesized attachment list. I think that would be a higher priority fix (and an easy one!).

Hm. I've never tried. I guess if it's easy, yeah, we should do that too. I agree that the iquery page is probably higher priority.

It would not. CL is never going to scrape instantaneously so it's always good to parse as much data as we can when we get it.

Agree. We should get that fixed. Should be an easy one for somebody that knows JS better than me.


Responding to @Pascal666's comments...

Good point, but I think you vastly overestimate how hard it is to determine if someone is running RECAP. I created http://23.238.17.229/recap.html in about 10 minutes.

Works in Chrome, not Firefox. Boy, I wonder if there's a way to prevent this and still have web resources in the extension. Seems difficult.

mlissner avatar Jul 12 '18 18:07 mlissner

I created http://23.238.17.229/recap.html in about 10 minutes.

In Chrome with RECAP enabled, it tells me RECAP not detected so...? What are you attmpeting?

We get content from a lot of places. We could also turn this on for RSS if we wanted to. Who knows why we're scraping a particular attachment page?

Again, the argument here is traffic analysis. I download a particular attachment page and it's followed by a CL scrape 5 minutes later, then it's clear I had RECAP enabled. At least if that's the pattern.

johnhawkinson avatar Jul 12 '18 21:07 johnhawkinson

In Chrome with RECAP enabled, it tells me RECAP not detected so...? What are you attmpeting?

You can check the JS for the extension, @johnhawkinson, but I suspect it doesn't work for you because you have a dev version running. I had to install the usual version to make it work.

Again, the argument here is traffic analysis. I download a particular attachment page and it's followed by a CL scrape 5 minutes later, then it's clear I had RECAP enabled. At least if that's the pattern.

I assume you mean that you download a particular docket, but I guess we could build in delays if we really thought this was a risk. That still might not work though if only one person ever downloads the docket from PACER, in which case, they'd be the guilty RECAP user. I'm not really feeling that this is a huge risk though, to be honest.

But, if we want to solve this risk, I guess one solution could be to just download ALL of the attachment pages every day by using the serial number in the doc1 ID. I'd put that at somewhere around 150k PACER requests/day.

mlissner avatar Jul 12 '18 21:07 mlissner

Re this helping the orphan problem: I still don't know what the fix for that is so long as we're getting documents before dockets (I expect this will continue forever). So in my mind, anything we can do to address that is a win.

Do the documents contain enough information to create a link to the docket on PACER?

You can check the JS for the extension, @johnhawkinson, but I suspect it doesn't work for you because you have a dev version running. I had to install the usual version to make it work.

Indeed. I've documented that page a little better.

Pascal666 avatar Jul 12 '18 22:07 Pascal666

Works in Chrome, not Firefox. Boy, I wonder if there's a way to prevent this and still have web resources in the extension. Seems difficult.

Well, we only use this for images, so we could inline all the images. Annoying but not difficult. And gets away from your preferred zero-build-system posture.

I'm not really feeling that this is a huge risk though, to be honest.

I'm not prepared, at this point, to make a judgement of the magnitude of the risk and do a cost-benefit tradeoff. But I want to send up the flag that this proposal is introducing a new kind of privacy problem and note that analysis is required.

johnhawkinson avatar Jul 13 '18 04:07 johnhawkinson

I'm not prepared, at this point, to make a judgement of the magnitude of the risk and do a cost-benefit tradeoff.

Totally agreed. To discuss when we are ready to cross this bridge.

Well, we only use this for images, so we could inline all the images

Yep. Joy. New issue filed: https://github.com/freelawproject/recap/issues/253 (we're way off topic)

mlissner avatar Jul 13 '18 05:07 mlissner

Well, I forgot how much conversation this humble issue has, but I did think of two more reasons that we should be doing active crawling of this content in some form:

  1. When you look at search results, having the short description is really nice. For example, the second result is much easier to understand because it has a short description:

    screenshot from 2018-09-18 13-19-20

  2. The HTML titles that we're using in twitter and elsewhere as of #863, which show the short description if we have one, and don't show any description if we don't have it.

mlissner avatar Sep 18 '18 20:09 mlissner

Once you have the cost of downloading a document, it would be nice to have a page where a user can type in how much money they want to spend on PACER (like the $15 everyone gets free) and receive a list of documents they can download for RECAP that will add up to that total cost.

Good point: that's probably out of scope on this task but it would go towards https://github.com/freelawproject/courtlistener/issues/1346 .

I remain skeptical at the merits of having CL autoscrape anything in response to a RECAP upload. Certainly we need to have a privacy conversation about that, it's another example of ways the courts could identify who is using RECAP and who is not

Maybe it's enough to have a semi-random wait before the scrape happens? You don't want to do it immediately anyway, because the user might upload the docket a few seconds or minutes later. If the scrape happens hours or days later it's probably going to be hard to correlate it to anything the user did (as opposed to a "coincidence" such as someone else becoming interested in that case after seeing it in the news or on social media or whatever).

nemobis avatar Mar 24 '22 17:03 nemobis