juriscraper
juriscraper copied to clipboard
Docket descriptions aren't always in column 2 of a PACER docket report
For instance, if View Multiple Documents is checked, they move to the 3rd column:

Which is notionally
<table><tbody>
<tr>
<td>Date Filed</td>
<th>#</th>
<td><small><a href="#" onclick="uncheckAll(document.view_multi_docs.arr_de_seq_nums)" '="">clear</a></small></td>
<td style="font-weight:bold">Docket Text</td>
</tr><tr>
<td>/21/2018</td>
<td><a>186</a> </td>
<td><input type="checkbox" name="arr_de_seq_nums" value="567" onclick="this.form.total.value=calculateTotal(this.checked, 44712.0);"> </td>
<td>Judge Mark L. Wolf: "It is hereby ORDERED that the parties
shall continue to confer and report on the status of their
discussions by December 3, 2018." ENDORSED ORDER entered. re <a>185</a>
Status Report filed by Kirstjen M NIELSEN (Bono, Christine)
(Entered: 11/21/2018)
</td>
</tr>
</tbody></table>
But the parser hardcodes column 2:
https://github.com/freelawproject/juriscraper/blob/f49243fa0f5e9b366159671037adf699f42b7823/juriscraper/pacer/docket_report.py#L1003-L1005
which is obviously wrong. I expect I'll fix this...in a while.
This is normalized early on when parsing by deleting the extra column:
https://github.com/freelawproject/juriscraper/blob/104d32b439c40bd571d36ac75e1c7a651b522b8b/juriscraper/pacer/docket_report.py#L799-L802
That way we don't have to think about it ever again and can just treat all tables (mostly) the same.
Sorry, no it is not:
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
(juriscraper) pb3:juriscraper jhawk$ cd -
/Users/jhawk/src/juriscraper/juriscraper/pacer/appellate
(juriscraper) pb3:appellate jhawk$ PYTHONPATH=~/src/juriscraper python -m juriscraper.pacer.docket_report 12.03.02\ CM_ECF\ -\ USDC\ Massachusetts\ -\ Version\ 6.1.2\ as\ of\ 6_9_2018.html
Warning: No such file or directory: /var/log/juriscraper/debug.log. Have you created the directory for the log?
Juriscraper will continue to run, and all logs will be sent to stderr.
2018-12-04 13:00:56,763 - WARNING: You are using a narrow build of Python, which is not completely supported. See issue #188 for details.
Parsing HTML file at 12.03.02 CM_ECF - USDC Massachusetts - Version 6.1.2 as of 6_9_2018.html
/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/html5lib/_ihatexml.py:257: DataLossWarning: Coercing non-XML name
warnings.warn("Coercing non-XML name", DataLossWarning)
Traceback (most recent call last):
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 1133, in <module>
pprint.pprint(report.data, indent=2)
File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 50, in data
data[u'docket_entries'] = self.docket_entries
File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 808, in docket_entries
de[u'date_filed'] = convert_date_string(date_filed_str)
File "/Users/jhawk/src/juriscraper/juriscraper/lib/string_utils.py", line 486, in convert_date_string
dt = parser.parse(date_string, fuzzy=fuzzy)
File "/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/dateutil/parser.py", line 1161, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/Users/jhawk/src/juriscraper/lib/python2.7/site-packages/dateutil/parser.py", line 555, in parse
raise ValueError("String does not contain a date.")
ValueError: String does not contain a date.
Also, Create Appendix
creates two additional columns.
Hm. Why would the date column be screwy? It's always in the first position? Does your example have something weird in there?
I don't think I've ever seen the "Create Appendix" feature. This a filing feature or just a mass feature?
Looks like it's just a mass
feature. Huh. Well, we should respond to that, but I think we're not getting it uploaded anyway because of the bug that prevents dockets from being uploaded when we see the intermediate "this docket is too big" page. That'd explain why I don't see errors about this page at least.
Yeah, I think Create Appendix
is ecf.mad
-specific, but View Multiple Documents is not. An elegant fix should cover them both.
You can certainly get such pages uploaded when someone restricts the docket report. I uploaded one of each variety last night, pq 1616825 and 1616823 I think.
Also, I suspect few people use this feature.
I don't know why the date column would be screwy though. Haven't had a chance to look. This Issue was really a placeholder for me to come back to it...