courtlistener
courtlistener copied to clipboard
ValueError: The 'filepath_local' attribute has no file associated with it.
Sentry Issue: COURTLISTENER-2V4
ValueError: The 'filepath_local' attribute has no file associated with it.
File "celery/app/autoretry.py", line 35, in run
return task._orig_run(*args, **kwargs)
File "cl/recap/tasks.py", line 295, in process_recap_pdf
file_contents = pq.filepath_local.read()
File "django/core/files/utils.py", line 42, in <lambda>
read = property(lambda self: self.file.read)
File "django/db/models/fields/files.py", line 43, in _get_file
self._require_file()
File "django/db/models/fields/files.py", line 40, in _require_file
raise ValueError("The '%s' attribute has no file associated with it." % self.field.name)
@mlissner I've checked this error in detail and seems to be a bit more complicated than we thought.
There are different potential errors that might need to be fixed, which I would try to explain in detail:
Something weird are the Processing Queues that were requested related to the same case: pq: 6563553 (PDF) 290 pq: 6563555 (PDF) 290 pq: 6563556 (PDF) 288 pq: 6563562 (Docket 144305) pq: 6563563 (Docket 144305)
These pqs are related to the same Docket with pacer_case_id
: 144305
As you can see there are two pqs for the same 290 document.
Also, two pq to process the same HTML Docket
Some questions about it are:
- I understand that these requests are made by the RECAP extension, is that right?
- Is it normal to request first the documents and after the Docket HTML?
- Is it normal for requests to be duplicated for the same document/docket?
The second problem that I found is that when processing the PDFs they failed because their pqs don't contain a pacer_case_id
so it wasn’t possible to link to the Docket (pacer_case_id: 144305). The error that they returned was: Too many dockets found when trying to save.
That error is raised on process_recap_pdf
here
d = Docket.objects.get(
pacer_case_id=pq.pacer_case_id, court_id=pq.court_id
)
I suspect that there is more than one Docket for the following query:
d = Docket.objects.get(
pacer_case_id="", court_id="nyed"
)
- Would it be possible to confirm it through the Django shell?
So: pq: 6563553 (PDF) 290 Failed due to “Too many dockets found when trying to save.” pq: 6563555 (PDF) 290 Failed due to “Too many dockets found when trying to save.” pq: 6563556 (PDF) 288 Failed due to “Too many dockets found when trying to save.”
Then the following process_recap_docket tasks start executing. pq: 6563562 (Docket 144305) created pq: 6563563 (Docket 144305) created
It seems that process_recap_docket
task for pq 6563563 started first:
Got docket entries for 290 (duplicated) and 288
Then it tried to get process_orphan_documents
, but it fails for 290 due to duplicated Recap documents.
But it succeed for 288 so the document was properly linked to the docket case id: 144305 Then the filepath_local of pq: 6563556 was deleted and marked as processed successfully. The document is available here
I think that almost at the same time process_recap_docket
for pq: 6563562 task was executed so this task also called process_orphan_documents due to a possible race condition. It failed again for 290 documents due to duplicated Recap documents.
When trying to link 288 it raised the error “The 'filepath_local' attribute has no file associated with it.” because its filepath_local was deleted by the previous task (6563563).
Because pq: 6563556 failed it remained with the status “Item is currently being processed” even though it succeeded previously.
Here is a diagram with more details about this possible race condition for a better understanding.
I think from CL side a solution to avoid re-processes a document that was previously added on process_recap_pdf
is to check before if the RD is available and also when getting the RD do a select_for_update
and transaction atomic
, so in case there are two tasks executing almost at the same time one of them should wait and avoid the race condition.
Also, I would like to be sure if the duplicate pqs is a normal thing or if it's something that we should review and solve on RECAP extension side.
Let me know what's your opinion about this, thanks!
I understand that these requests are made by the RECAP extension, is that right?
Yes, if there's a PQ, it's usually coming from RECAP, but sometimes I'll generate PQ's from a script because it's an easy way to get something into the processing pipeline. That hasn't happened recently though.
Is it normal to request first the documents and after the Docket HTML?
It's not normal, exactly, but it definitely happens and the system is supposed to support it. This can happen, for example, when I share a link to a PDF on PACER with you. You might click the link, buy the doc, and then upload it to us via RECAP. But we might not have the docket yet! When this happens, we keep the PDF laying around, and then if the docket ever comes, we automatically link up the PDFs we have for it. This is a neat trick b/c sometimes somebody uploads a docket and suddenly a bunch of docs are available.
Is it normal for requests to be duplicated for the same document/docket?
Sure. Say two people are buying a doc at the same time or around the same time. In that case, they'll both upload it. Sometimes, people just want to buy the doc, even though we have it — who knows why. In that case, we get the second doc some time after the first. Pretty normal stuff for us.
Also, two pq to process the same HTML Docket
When it comes to dockets, people can buy only part of it to save money. The whole docket will cost $3, but if you just want stuff since the last time you checked it, you might buy only the last few updates for the cost of 10¢, say. So it's very normal to have lots of docket updates for a single case.
This query:
d = Docket.objects.get( pacer_case_id="", court_id="nyed" )
Will bring back thousands or tens of thousands of cases, yes. So many, in fact, that our poor server should probably be saved from running the query, because it's going to be a pretty slow query.
I think from CL side a solution to....
This is really great analysis, wow. Yes, those solutions look great, and by the time I finished reading, I was already reaching for select_for_update
.
One question for you. We have two terrible issues we can't resolve:
- Somehow dockets sometimes get intermixed, with entries from one docket showing up on the other docket. Could that be related?
- Sometimes we wind up with duplicate dockets for the same case. I think this is a variation of the first issue, and what happens is that the second docket gets merged into an existing one, overwriting its case name and turning it into a duplicate. Could that be related?
Thanks for the detailed analysis.
Great, thank you for your answers! Now I think I have enough info that confirms the issue so I'll work to apply the solution to avoid the race condition that is happening here.
About your questions
Somehow dockets sometimes get intermixed, with entries from one docket showing up on the other docket. Could that be related?
Sometimes we wind up with duplicate dockets for the same case. I think this is a variation of the first issue, and what happens is that the second docket gets merged into an existing one, overwriting its case name and turning it into a duplicate. Could that be related?
It would be great if you have some links to dockets with these issues, duplicated dockets for the same case, and docket entries intermixed (Is this issue different from the one reported by email where the docket is different from the docket alert?)
So I could check in detail and see if these are related issues or something else we need to solve.
It would be great if you have some links to dockets with these issues, duplicated dockets for the same case, and docket entries intermixed (Is this issue different from the one reported by email where the docket is different from the docket alert?)
Yes, the email I sent is representative of the issue. I'll send you more examples as people report them.
Perfect, well I'll start checking those other issues with the data we have now and see how they are related.
I've done some tests about this issue. I managed to reproduce a race condition that duplicates docket entries and recap documents.
So I could assess some possible solutions.
I think the problem is on this line:
de, de_created = DocketEntry.objects.get_or_create(
docket=d, entry_number=docket_entry["document_number"]
)
So I tested to use select_for_update()
and @transaction.atomic
de, de_created = DocketEntry.objects.select_for_update().get_or_create(
docket=d, entry_number=docket_entry["document_number"]
)
However, it didn't work in this case (on MySQL it works but not on Postgres), seems that select_for_update()
only locks DB rows when retrieving elements as in a filter()
or get()
not when creating new elements.
I also tested to create a DB constraint for DocketEntry model: unique_together = ('docket', 'entry_number',)
In order to avoid duplicated docket entries on the same case and it worked, no race condition and no duplicated entries here.
But I'm not sure if it's possible to add this DB constraint without affecting how docket entries work on PACER.
I saw that there are some docket entries without an entry_number
. On my tests, it was possible to add multiple docket entries without an entry_number (null on DB).
Another thing to consider is that it would be necessary to delete duplicated docket entries for existing cases before applying the DB constraint migration, otherwise, it'll fail.
Please let me know what you think.
I also checked the issue related to the Docket alert linked to a different case, seems that it's not related to this race condition. I'll open an issue too about it with my findings.
Great. Can you give me a query to check how easily the constraint will be to add? I.e., how many rows would need deletion/alteration/etc?
Of course, I checked this:
To add a constraint to the DB seems that the whole table will be scanned: ADD table_constraint
Normally, this form will cause a scan of the table to verify that all existing rows in the table satisfy the new constraint.
So to count the total number of records into the search_docketentry
table it's possible to use:
SELECT count(*) AS exact_count FROM public.search_docketentry;
This one is slower because it retrieves the exact count.
It's also possible to get an estimate with:
SELECT reltuples AS estimate FROM pg_class where relname = 'search_docketentry';
Also to add the constraint will be needed to delete the duplicated records (more than one entry_number for the same docket_id), with the following query you can get retrieve the records that meet this condition:
SELECT docket_id, entry_number, count(*) FROM public.search_docketentry GROUP BY docket_id, entry_number having count(*) > 1;
It will return the list of duplicated records with the count of existing records of each pair:
docket_id | entry_number | count |
---|---|---|
7 | 155 | 3 |
104 | 3 | 2 |
I had to tweak the query:
SELECT docket_id, entry_number, count(*)
FROM public.search_docketentry
WHERE entry_number is not null
GROUP BY docket_id, entry_number
HAVING count(*) > 1;
When I did that, it timed out, so I added a limit of 100:
docket_id | entry_number | count
-----------+--------------+-------
2372 | 803694284 | 2
2372 | 803694297 | 2
2372 | 803694306 | 2
2372 | 803694592 | 2
9851 | 65 | 2
9851 | 66 | 2
9851 | 67 | 2
9851 | 68 | 2
9851 | 69 | 2
9851 | 70 | 2
9851 | 73 | 2
11607 | 803360597 | 2
17216 | 107491229 | 2
17216 | 107492369 | 2
17901 | 4 | 2
17901 | 7 | 2
17901 | 8 | 2
17901 | 9 | 2
28397 | 12 | 2
28397 | 13 | 2
42748 | 8 | 2
42748 | 10 | 2
42748 | 11 | 2
42748 | 12 | 2
42748 | 14 | 2
242473 | 3744 | 2
242473 | 3745 | 2
242473 | 3746 | 2
242473 | 3747 | 2
242473 | 3748 | 2
242473 | 3749 | 2
242473 | 3750 | 2
242473 | 3751 | 2
242473 | 3752 | 2
242473 | 3753 | 2
242473 | 3754 | 2
242473 | 3755 | 2
242473 | 3756 | 2
242473 | 3757 | 2
242473 | 3758 | 2
242473 | 3759 | 2
242473 | 3760 | 2
242473 | 3761 | 2
242473 | 3762 | 2
242473 | 3763 | 2
242473 | 3764 | 2
242473 | 3765 | 2
242473 | 3766 | 2
242473 | 3767 | 2
242473 | 3768 | 2
242473 | 3769 | 2
242473 | 3770 | 2
242473 | 3771 | 2
242473 | 3772 | 2
242473 | 3773 | 2
242473 | 3774 | 2
242473 | 3775 | 2
242473 | 3776 | 2
242473 | 3777 | 2
242473 | 3778 | 2
242473 | 3779 | 2
242473 | 3780 | 2
242473 | 3781 | 2
242473 | 3782 | 2
242473 | 3783 | 2
242473 | 3784 | 2
242473 | 3785 | 2
242473 | 3789 | 2
242473 | 3792 | 2
242473 | 3793 | 2
242473 | 3794 | 2
242473 | 3795 | 2
242473 | 3796 | 2
242473 | 3797 | 2
242473 | 3798 | 2
242473 | 3799 | 2
242473 | 3801 | 2
242473 | 3802 | 2
242473 | 3803 | 2
242473 | 3854 | 2
242473 | 3855 | 2
242473 | 3856 | 2
242473 | 3858 | 2
242473 | 3859 | 2
242473 | 3860 | 2
242473 | 3861 | 2
242473 | 3862 | 2
242473 | 3863 | 2
242473 | 3864 | 2
242473 | 3865 | 2
242473 | 3866 | 2
242473 | 3868 | 2
242473 | 3869 | 2
242473 | 3870 | 2
242473 | 3871 | 2
242473 | 3872 | 2
242473 | 3873 | 2
242473 | 3874 | 2
242473 | 3875 | 2
242473 | 3876 | 2
I'm not sure what this teaches us...I looked at a few. They seem fairly old and like real duplicates, I think. We might be able to do a mass deletion to fix this, but it'd be pretty scary. I wasn't sure how to get a count of the duplicate rows either, and maybe getting a random sample of this would be good?
Thanks! I looked at some of these results and I found some that might indicate that sometimes these are not duplicates:
For instance, the ones related to this case: https://www.courtlistener.com/docket/42748/toru-paul-may/
docket_id | entry_number | count
-----------+--------------+-------
42748 | 8 | 2
42748 | 10 | 2
42748 | 11 | 2
42748 | 12 | 2
42748 | 14 | 2
So yes, I think it might be pretty risky to massive delete duplicates.
However, it seems that these kinds of duplicates are added on different dates. So I was thinking about an alternative to avoid a race condition when multiple docket entries with the same entry number are tried to add at the same time.
This alternative is to add a Redis semaphore similar to the one used to avoid race conditions when sending Docket Alerts create_redis_semaphore
so that only one task could process the document upload for the same pacer_doc_id at the same time.
I read that this semaphore still has a very small race condition, so I just need to confirm that it could help here.
Does it sound good to you?
Yeah, if we can't do it using postgresql, then a redis lock makes sense. I'm not sure if the current locking mechanism is perfect, but it'd probably fix 99% of these issues. There are some blog posts too that might be interesting (though they look complicated).
Does it look like this might be related: https://sentry.io/organizations/freelawproject/issues/3477741404/
Yeah, if we can't do it using postgresql, then a redis lock makes sense
I took a look at #2144 and I got hopeful that maybe it's worth one more look to make sure that we can't accomplish this using DB locking (it's just so much more robust).
One thing that looked sort of, maybe, promising is using select_related
with the select_for_update
. Did you try that too? https://docs.djangoproject.com/en/dev/ref/models/querysets/#select-for-update
Of course. I only checked select_for_update
independently (that didn't work) but I haven't checked it in combination with selec_related
. I will give it a look and see if it works.
Thanks for the suggestion.
You're ready for this to be closed, @albertisfu?
Yeah! This is ok to be closed!
One down!
Sentry issue: COURTLISTENER-6NS