courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Import Troller BK Data

Open mlissner opened this issue 2 years ago • 1 comments

We have about 2TB of PACER RSS feeds that a friendly supporter gave us. These go back many years and we should import as much of it as we can.

It's all stored in our private S3 bucket in the sources/troller-files directory.

I think importing this is pretty simple. We have code for parsing PACER RSS feeds that works really well, so we just need to alter that code so:

  1. It pulls from S3. Maybe an easy way to get a list of all the files in S3 for this instance is via an S3 inventory file.

  2. It does not alter docket information that's currently in our system. Since this data is old, we need to be careful that it not trammel the data we currently have (which might be newer, yes, sometimes PACER data is updated in place).

  3. We don't import any RSS feeds from after when we were doing it ourselves. No point in trying to import a feed we already imported. We started monitoring and importing RSS info on: 2018-04-19, according to our git logs.

Finally:

  1. It'd also be nice to keep track of how many entries this adds, but it's not super important. Still, could be useful to brag about.

There are three performance concerns to think about when importing this:

  1. It'll get run in cl-celery, which doesn't scale out yet without manual intervention (though we can certainly take that step)
  2. It'll add stuff to our DB, which does not scale without manual intervention and added cost.
  3. It'll add things to Solr, which does not scale without manual intervention and added cost.

I think the best speed to run this at will be as fast as it can go without messing up the DB and Solr.

mlissner avatar May 31 '22 23:05 mlissner

Another note: We need to import these in chronological order somehow. The filenames might provide a clue, I forget.

mlissner avatar Jun 01 '22 00:06 mlissner

Alberto, when you're ready, let's discuss this one.

mlissner avatar Jan 14 '23 01:01 mlissner

Well, this seems to be the next issue on my backlog, so I'm posting some questions that came to my mind so we could start discussing this one:

It pulls from S3. Maybe an easy way to get a list of all the files in S3 for this instance is via an S3 inventory file.

Got it, once we have the list of all the files. I'm thinking about which approach might be better to control which files are being processed and feed the celery task, one approach I was thinking about, is to use a text file with an ordered list of files to be processed so we could open the file, pull some rows and send to the celery queue. Or I was wondering if would be possible to create a temporal table in DB for this purpose?

It does not alter docket information that's currently in our system. Since this data is old, we need to be careful that it not trammel the data we currently have (which might be newer, yes, sometimes PACER data is updated in place).

So, we should add a condition that checks if a docket entry already exists we'll avoid adding that RSS data, right? If the docket already exists but we don't have the docket entry, we should add it, correct?

I saw in S3 there are RSS files from appellate courts, checking scrape_rss command seems that currently we only scrape district and bankruptcy courts? does that mean we should add all the data that we don't already have from appellate courts, regardless of the RSS date?

Another note: We need to import these in chronological order somehow. The filenames might provide a clue, I forget.

Well, seems that the filesnames don't provide a clue about chronological order, but I think it'll be easier to write and script that helps us to get the right order of import of the files.

It'll get run in cl-celery, which doesn't scale out yet without manual intervention (though we can certainly take that step) It'll add stuff to our DB, which does not scale without manual intervention and added cost. It'll add things to Solr, which does not scale without manual intervention and added cost. I think the best speed to run this at will be as fast as it can go without messing up the DB and Solr.

About this, to control the rate we should process the data I think it would be useful to use the @throttle_task decorator, right? We might just need an idea of what would be an ideal rate.

albertisfu avatar Jan 31 '23 01:01 albertisfu

I'm thinking about which approach might be better [...] one approach [is] a text file with an ordered list of files to be processed

Yes, I think that's best. Here are the four inventory files that show what's in the bucket:

1c1c2d14-0055-43dd-8d22-bae1efc1e0e0.csv.gz 742d0200-5b70-4c94-8c7c-9640ca4a2f22.csv.gz 80a98c7f-c116-44b5-940d-9dda598ab292.csv.gz 260a86a9-6760-4599-a5c2-014edb04be56.csv.gz

So, we should add a condition that checks if a docket entry already exists we'll avoid adding that RSS data, right?

Right.

If the docket already exists but we don't have the docket entry, we should add it, correct?

Correct, but I think we should have all of these dockets. Most will just be stubs, but we should have them.

does that mean we should add all the data that we don't already have from appellate courts, regardless of the RSS date?

Well, in theory, yes, but we don't have parsers for appellate RSS since we don't usually scrape it. I guess we could add appellate RSS scraper support to Juriscraper and then get back to this. It's probably worth doing, unfortunately, and so I've opened a couple issues to do that:

https://github.com/freelawproject/courtlistener/issues/2507 https://github.com/freelawproject/juriscraper/issues/646

Sorry, but it's the right approach so we can import troller as best as possible.

I think it'll be easier to write and script that helps us to get the right order of import of the files.

Yeah, we may have to do that. It's worth double checking the filenames in the inventory files.

About this, to control the rate we should process the data I think it would be useful to use the @throttle_task decorator, right? We might just need an idea of what would be an ideal rate.

Let's start without celery and just see how slow that is. I kind of prefer not overloading our DB and letting it run for a week or a month if that works for everybody.

mlissner avatar Jan 31 '23 22:01 mlissner

Thank you! So yes, I agree it's better to start by adding support for appellate RSS, so we can be sure we can import troller RSS correctly. I'll start checking those issues then.

albertisfu avatar Feb 01 '23 00:02 albertisfu

Measuring the times that take every step of the importing and merging process of a troller RSS file I got the following results:

This is a file of around 800KB that contains approx. 810 entries and it was merged into an empty DB:

  • Download from S3 time: 1.3318195343017578 seconds

  • Read file time: 0.9539427757263184 seconds

  • Parse file time: 2.273982048034668 seconds

  • Merge entries into DB time: 25.893524646759033 seconds

  • Add elements to Solr: 23.709644317626953 seconds

Going into detail about the parsing time in Juriscraper using the same file:

  • Actual parsing time: 0.30570292472839355 seconds
  • Generate data dict: 2.520118236541748 seconds

The processes that are taking the longest are merging the docket entries and adding them to SOLR. I will review if we can optimize something into these steps, before proposing horizontal scaling, which would involve using Celery to merge multiple files simultaneously, however, this approach could also have a significant impact on the database load.

I believe there is room for improvement in the file download process. Currently, it takes roughly 1.3 seconds per file. Storing the files in a local volume may help, but I am not sure whether that is possible given our architecture.

Parsing the RSS file using feedparser is not taking as much time as we thought. However, generating the data dictionary is taking more time. It seems that parsing the datetime using dateparser is the main issue. I will investigate this further to improve its performance.

albertisfu avatar Feb 27 '23 17:02 albertisfu

A few thoughts below, but before we look at that, should we

* Download from S3 time:  1.3318195343017578 seconds

We could change this to always stay a few downloads ahead of processing, right? It could just keep a small in-memory dict with the data?

We could also download to a volume, but that seems like a lot of work to get it all downloaded in advance when we can do it simultaneously just as well.

* Read file time:  0.9539427757263184 seconds

This seems weirdly long too, right? reading ~1MB from disk should be very fast. Also, I'm surprised we're reading the file at all. Shouldn't it be in memory from the download?

* Parse file time:  2.273982048034668 seconds

No surprise here.

* Merge entries into DB time: 25.893524646759033 seconds

This isn't great. We could try to optimize this further, but I'm not sure how. DB indexes are one opportunity, maybe via a multi-column index.

* Add elements to Solr: 23.709644317626953 seconds

We can easily make this async, right? That'll help.

Going into detail about the parsing time in Juriscraper using the same file:

* Actual parsing time: 0.30570292472839355 seconds

* Generate data dict: 2.520118236541748 seconds

The processes that are taking the longest are merging the docket entries and adding them to SOLR. I will review if we can optimize something into these steps, before proposing horizontal scaling, which would involve using Celery to merge multiple files simultaneously, however, this approach could also have a significant impact on the database load.

Yeah, doing it via Celery seems like a tricky idea for sure.

Parsing the RSS file using feedparser is not taking as much time as we thought. However, generating the data dictionary is taking more time. It seems that parsing the datetime using dateparser is the main issue. I will investigate this further to improve its performance.

What about the idea of splitting up the file to make it faster, so we only do part of it at a time? Was that too ugly of a hack?

Also, could we do our parsing one item at a time, as a stream? That might be easier than adding streaming support to feedparser.

mlissner avatar Feb 27 '23 18:02 mlissner

Thanks for your comments and suggestions, I've already submitted a PR #653 with some Juriscraper performance improvements. I also tested the approach of parsing one item at a time with no apparent performance improvement (more details in the PR, so you can evaluate if that's what you meant).

I'll go through your other suggestions and comments on CL side.

albertisfu avatar Feb 27 '23 22:02 albertisfu

I tested your suggestion about parsing one RSS item at a time instead of parsing the whole file so that once we hit a cached item we can skip the rest of the file saving time.

These were my findings.

For a file with approx. 1296 entries.

Parsing the whole file after the Juriscraper optimization, took:

  • Parse file time: 0.25797462463378906
  • Generate file dict data time: 0.4553248882293701

Total parsing time: 0.718124 seconds

Splitting the RSS file and parsing one item at a time:

  • Parse item time: 0.00365
  • Generate item dict data time: 0.000792

Total item parsing time: 0.004442 seconds

Screenshot 2023-02-28

So that I think this can help as a performance improvement if each RSS file only has a few new elements and the rest are cached elements. After approx. 160 elements the time it takes to parse elements will be worse than parsing the whole file at once.

We could do something like, parse one item at a time until we hit a cached element so we skip the rest of the file if we reach the element number "x" (a threshold) parse the whole file and instead of continuing parsing one item a time, so the parsing time is not worse.

So this can help us to save a fraction of a second on files where most of the items are already in the cache.

  • About having a few downloads ahead of processing, I think this would definitely help to save the download time, we can download multiple files in parallel and put them in a python queue and process files in the queue instead of downloading one file at a time.

  • The other optimization we can do is adding elements to Solr, you mention we could do it async, do you mean using threading or something similar instead, so we don't use celery right?

Let me know if I should further review once of all of these optimization plans.

albertisfu avatar Feb 28 '23 21:02 albertisfu

Thanks, that makes sense. Before we do the parsing hack (I suspect it'll be ugly), let's do the easy things:

  • Move the solr to celery by adding delay to it (this should help a lot, right?)
  • Use a download queue for S3 downloads. That makes sense to me.

Then, if we're still not happy, let's do the hack. :)

mlissner avatar Feb 28 '23 23:02 mlissner

FWIW, we're about 200k files through this (out of 3M), and Solr currently has 5.4M bankruptcy dockets. Most of this data is bankruptcy related, so I expect this number to go way up as we gather it. (This should make #2572 slightly better).

mlissner avatar Mar 06 '23 17:03 mlissner

In case anybody is wondering, yes, this is still going. It's about 1.8M files through the collection and I check on it pretty much every day. It's chugging along. This is certainly the biggest addition to our database in terms of item count.

mlissner avatar May 10 '23 16:05 mlissner

Slooowly making progress. Currently at:

./manage.py troller_bk --offset 2589103 --file /storage/tmp/mlissner/final-file-cleaned-and-ordered.csv

mlissner avatar Jun 29 '23 13:06 mlissner

This is done at long last. I added a lifecycle rule in S3 to move all of the troller data to S3 Glacier Deep Archive. I should remove that rule in a few days.

mlissner avatar Jul 13 '23 17:07 mlissner