juriscraper Docket descriptions aren't always in column 2 of a PACER docket report

For instance, if View Multiple Documents is checked, they move to the 3rd column:

Which is notionally

<table><tbody>
  <tr>
    <td>Date Filed</td>
    <th>#</th>
    <td><small><a href="#" onclick="uncheckAll(document.view_multi_docs.arr_de_seq_nums)" '="">clear</a></small></td>
    <td style="font-weight:bold">Docket Text</td>
  </tr><tr>
    <td>/21/2018</td>
    <td><a>186</a>&nbsp;</td>
    <td><input type="checkbox" name="arr_de_seq_nums" value="567" onclick="this.form.total.value=calculateTotal(this.checked, 44712.0);"> </td>
    <td>Judge Mark L. Wolf: "It is hereby ORDERED that the parties
      shall continue to confer and report on the status of their
      discussions by December 3, 2018." ENDORSED ORDER entered. re <a>185</a>
      Status Report filed by Kirstjen M NIELSEN (Bono, Christine)
      (Entered: 11/21/2018)
    </td>
  </tr>
</tbody></table>

But the parser hardcodes column 2:

https://github.com/freelawproject/juriscraper/blob/f49243fa0f5e9b366159671037adf699f42b7823/juriscraper/pacer/docket_report.py#L1003-L1005

which is obviously wrong. I expect I'll fix this...in a while.

Dec 04 '18 04:12 johnhawkinson

This is normalized early on when parsing by deleting the extra column:

https://github.com/freelawproject/juriscraper/blob/104d32b439c40bd571d36ac75e1c7a651b522b8b/juriscraper/pacer/docket_report.py#L799-L802

That way we don't have to think about it ever again and can just treat all tables (mostly) the same.

Dec 04 '18 17:12 mlissner

Sorry, no it is not:

Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
(juriscraper) pb3:juriscraper jhawk$ cd -
/Users/jhawk/src/juriscraper/juriscraper/pacer/appellate
(juriscraper) pb3:appellate jhawk$ PYTHONPATH=~/src/juriscraper python -m juriscraper.pacer.docket_report 12.03.02\ CM_ECF\ -\ USDC\ Massachusetts\ -\ Version\ 6.1.2\ as\ of\ 6_9_2018.html 

Warning: No such file or directory: /var/log/juriscraper/debug.log. Have you created the directory for the log?
Juriscraper will continue to run, and all logs will be sent to stderr.
2018-12-04 13:00:56,763 - WARNING: You are using a narrow build of Python, which is not completely supported. See issue #188 for details.
Parsing HTML file at 12.03.02 CM_ECF - USDC Massachusetts - Version 6.1.2 as of 6_9_2018.html
/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/html5lib/_ihatexml.py:257: DataLossWarning: Coercing non-XML name
  warnings.warn("Coercing non-XML name", DataLossWarning)
Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 1133, in <module>
    pprint.pprint(report.data, indent=2)
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 50, in data
    data[u'docket_entries'] = self.docket_entries
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 808, in docket_entries
    de[u'date_filed'] = convert_date_string(date_filed_str)
  File "/Users/jhawk/src/juriscraper/juriscraper/lib/string_utils.py", line 486, in convert_date_string
    dt = parser.parse(date_string, fuzzy=fuzzy)
  File "/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/dateutil/parser.py", line 1161, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/dateutil/parser.py", line 555, in parse
    raise ValueError("String does not contain a date.")
ValueError: String does not contain a date.

Also, Create Appendix creates two additional columns.

Dec 04 '18 18:12 johnhawkinson

Hm. Why would the date column be screwy? It's always in the first position? Does your example have something weird in there?

I don't think I've ever seen the "Create Appendix" feature. This a filing feature or just a mass feature?

Dec 04 '18 18:12 mlissner

Looks like it's just a mass feature. Huh. Well, we should respond to that, but I think we're not getting it uploaded anyway because of the bug that prevents dockets from being uploaded when we see the intermediate "this docket is too big" page. That'd explain why I don't see errors about this page at least.

Dec 04 '18 18:12 mlissner

Yeah, I think Create Appendix is ecf.mad-specific, but View Multiple Documents is not. An elegant fix should cover them both.

You can certainly get such pages uploaded when someone restricts the docket report. I uploaded one of each variety last night, pq 1616825 and 1616823 I think.

Also, I suspect few people use this feature.

I don't know why the date column would be screwy though. Haven't had a chance to look. This Issue was really a placeholder for me to come back to it...

Dec 04 '18 18:12 johnhawkinson

juriscraper juriscraper copied to clipboard

Docket descriptions aren't always in column 2 of a PACER docket report

juriscraper
juriscraper copied to clipboard