courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Harvard Opinions fails to import all volumes

Open flooie opened this issue 3 years ago • 8 comments

Harvard Opinions will stop, without an error, in the middle of a reporter.

I've seen it occur on a number of reporters at this point, but when I attempted to import all of sw3d it would stop at volume 13.
When I ran the command to begin at volume 13, it continued onward.

I don't have an explanation, but two things to note.

  1. We are going to need (and this was already he case) a good system to compare IA and the logs to see how often this is happening.
  2. There may be a bug that crashes or ends the import without an error we should attempt to find (or a broken json file)

flooie avatar Jul 20 '22 20:07 flooie

It crashed or stopped again ... just for context at

INFO Processing opinion at /storage/harvard_corpus/law.free.cap.sw3d.134/719.9249004.json

Not sure if there is anything specific about this particular opinion yet.

flooie avatar Jul 20 '22 21:07 flooie

And again with INFO Processing opinion at /storage/harvard_corpus/law.free.cap.n-mar-i.1/1.1693654.json

The northern Mariana Islands

flooie avatar Jul 20 '22 21:07 flooie

Do we have any naked exceptions in there, like:

try:
    do_something()
except:
    do_something_else()

When I've had silent crashes like this, that's sometimes the culprit?

mlissner avatar Jul 21 '22 06:07 mlissner

After some extensive digging into this matter, I am not convinced the bug is necessarily inside the importer. I walked myself thru all the permutations between error handling and successful addition.

My leading theory was that there was a bug in the glob path maker that just simply shortened the number of files to iterate over. This would allow it to end early, without throwing a sentry error and continue on the way.

I even came across a bug that gave me some hope for my theory https://github.com/python/cpython/issues/83075 here that talks about issues with glob.

But after more analysis I can't run with this theory either.

INFO Adding opinion for: 25 N.J. 54
INFO Finished: 25 N.J. 54
INFO Finished adding case at https://www.courtlistener.com/opinion/7362507/mazzilli-v-accident-casualty-insurance
INFO Processing opinion at /storage/harvard_corpus/law.free.cap.nj.25/55.1314213.json
=== NEW IMPORT ===
INFO Processing opinion at /storage/harvard_corpus/law.free.cap.nj-eq.1/1.12164279.json
=== NEW IMPORT ===
INFO Processing opinion at /storage/harvard_corpus/law.free.cap.njl.1/1.322050.json
=== NEW IMPORT ===
INFO Processing opinion at /storage/harvard_corpus/law.free.cap.nj-manumission.1/7.6675896.json
.... and this continued with the nj-manumission reporter (which I believe is updated but not added to main

Here is a section of code that was running a set of NJ reporters.

=== NEW IMPORT === is not part of the importer but logging I add in-between reporters in my run command.

Once a Processing opinion begins there is no known mechanism for exiting the harvard importer without logging a WARNING, an INFO

logger.error logger.warning logger.info

and one errant logging.warning

At this point, I'm not sure what to do. I think this is probably pretty common but I haven't done the math yet, but I would suggest at a minimum adding a logger.info call at the start of each reporter / volume logging the number of volumes to run and the number of opinions to add at the start of each respectively. That would at least let me move away a set of unlikely but nagging ideas (ie. the glob bug (with some mystery silent crash)).

flooie avatar Jul 25 '22 22:07 flooie

Another data point here, In an effort to complete a reporter- I was running three sections of the reporter at the same time but different volumes.

Two of three stopped in the same manner as others and finished as if the other volumes didn't exist. I take this new discovery as evidence that something outside the script itself is going on.

flooie avatar Jul 26 '22 00:07 flooie

Do you have a log line that indicates that the script is finishing successfully (or not)?

mlissner avatar Jul 26 '22 05:07 mlissner

yes. we have a few ending spots, but they are all logged.

flooie avatar Jul 26 '22 06:07 flooie

Well, that's good. Keep digging. :)

On Mon, Jul 25, 2022 at 11:15 PM William Palin @.***> wrote:

yes. we have a few ending spots, but they are all logged.

— Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/2191#issuecomment-1195050206, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABZ3KVKOWL4VQKOS7BJPVTVV57ADANCNFSM54FDTUFA . You are receiving this because you commented.Message ID: @.***>

-- Mike Lissner Executive Director Free Law Project https://free.law

mlissner avatar Jul 26 '22 06:07 mlissner