juriscraper
juriscraper copied to clipboard
text file case reports lead to re-establishing session
Hi,
I have the following function to download pipe-delimited .txt case filing reports from several bankruptcy courts (I believe case filed report scraping has not been implemented yet, according to #185 ).
def query_case_report(court_id, pacer_session, range_begin, range_end):
url = "https://ecf.{}.uscourts.gov/cgi-bin/CaseFiled-Rpt.pl?1-L_1_0-1".format(court_id)
data = {
'trustee': 'All',
'DateType': 'filed',
'StartDate': range_begin,
'EndDate': range_end,
'include_dismissed_cases': '1',
'case_type': 'bk',
'open_cases': 'on',
'closed_cases': 'on',
'party_information': 'on',
'sort1': 'filed date',
'data_format': 'data only',
'display_dataonly_header': 'on'
}
intermediate_response = pacer_session.post(url, data)
intermediate_doc = BeautifulSoup(intermediate_response.content, 'lxml')
form = intermediate_doc.find('form')
action = form.attrs.get('action')
action_path = action.split('/')[-1]
text_url = 'https://ecf.{}.uscourts.gov/cgi-bin/'.format(court_id) + action_path
response = pacer_session.post(text_url, data)
return response
query_case_report('caeb', ps, '04/01/2020', '04/15/2020')
The issue is that the second POST request triggers an unnecessary re-login. (Side note: I've tried the suggestions in #185 but wasn't able to return the pipe-delimited text file with just one request. Any suggestions are welcome.) I've narrowed it down to the check_if_logged_in_page(text)
function in the juriscraper.pacer.http.py
script, which parses the returned HTML for the relevant text on the page. However, when a report is returned in the pipe delimited data only
format, then the function defaults to the False
case.
I believe the following should fix it. In pacer.utils
add the following function:
def is_text(response):
"""Determines whether the item downloaded is a text file or something else."""
if '.txt' in response.headers.get("content-type"):
return True
return False
And in pacer.http
, add the following line below the is_pdf
clause in the _login_again
function:
if is_text(r):
return False
I tested these changes in a big scraping loop I'm running, and it seems to be working fine and has stopped the re-logins, but I'm definitely no expert on this so there might be a corner case I've missed/haven't thought about.
Thanks for building this library!
That's a solid improvement, yep. IIRC, there's an if_pdf check in there too, right? Seems like a good pattern to follow, and yes, I'm not at all surprised by this issue.
I can't think of any other areas of the code that'd be affected by a change like this.
Want to do a PR?
Yeah, there is a is_pdf
check as well.
Sure I can do a PR. It'll be my first for a FOSS project. Exciting!
Awesome! Lots of first-timers to Juriscraper. Looking forward to it.