juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

text file case reports lead to re-establishing session

Open mseflek opened this issue 4 years ago • 3 comments

Hi,

I have the following function to download pipe-delimited .txt case filing reports from several bankruptcy courts (I believe case filed report scraping has not been implemented yet, according to #185 ).

def query_case_report(court_id, pacer_session, range_begin, range_end):
    url = "https://ecf.{}.uscourts.gov/cgi-bin/CaseFiled-Rpt.pl?1-L_1_0-1".format(court_id)
    data = {
        'trustee': 'All',
        'DateType': 'filed',
        'StartDate': range_begin,
        'EndDate': range_end,
        'include_dismissed_cases': '1',
        'case_type': 'bk',
        'open_cases': 'on',
        'closed_cases': 'on',
        'party_information': 'on',
        'sort1': 'filed date',
        'data_format': 'data only',
        'display_dataonly_header': 'on'
    }

    intermediate_response = pacer_session.post(url, data)
    intermediate_doc = BeautifulSoup(intermediate_response.content, 'lxml')
    form = intermediate_doc.find('form')
    action = form.attrs.get('action')
    action_path = action.split('/')[-1]
    text_url = 'https://ecf.{}.uscourts.gov/cgi-bin/'.format(court_id) + action_path
    response = pacer_session.post(text_url, data)
    return response

query_case_report('caeb', ps, '04/01/2020', '04/15/2020')

The issue is that the second POST request triggers an unnecessary re-login. (Side note: I've tried the suggestions in #185 but wasn't able to return the pipe-delimited text file with just one request. Any suggestions are welcome.) I've narrowed it down to the check_if_logged_in_page(text) function in the juriscraper.pacer.http.py script, which parses the returned HTML for the relevant text on the page. However, when a report is returned in the pipe delimited data only format, then the function defaults to the False case.

I believe the following should fix it. In pacer.utils add the following function:

def is_text(response):
    """Determines whether the item downloaded is a text file or something else."""
    if '.txt' in response.headers.get("content-type"):
        return True
    return False

And in pacer.http, add the following line below the is_pdf clause in the _login_again function:

if is_text(r):
    return False

I tested these changes in a big scraping loop I'm running, and it seems to be working fine and has stopped the re-logins, but I'm definitely no expert on this so there might be a corner case I've missed/haven't thought about.

Thanks for building this library!

mseflek avatar Apr 27 '20 19:04 mseflek

That's a solid improvement, yep. IIRC, there's an if_pdf check in there too, right? Seems like a good pattern to follow, and yes, I'm not at all surprised by this issue.

I can't think of any other areas of the code that'd be affected by a change like this.

Want to do a PR?

mlissner avatar Apr 27 '20 20:04 mlissner

Yeah, there is a is_pdf check as well.

Sure I can do a PR. It'll be my first for a FOSS project. Exciting!

mseflek avatar Apr 27 '20 21:04 mseflek

Awesome! Lots of first-timers to Juriscraper. Looking forward to it.

mlissner avatar Apr 27 '20 21:04 mlissner