juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Mass. scraper skips unpublished decisions

Open johnhawkinson opened this issue 7 years ago • 7 comments

Sometimes they're important.

Today I wanted 16-P-1378: Richard D. Fanning vs. Board of Zoning Appeal of Cambridge. Doesn't show up.

The Court publishes a PDF called "List of Unpublished Appeals Court Decisions for February 15, 2018" which itself contains links to further PDFs. The scraper doesn't handle this.

In fact it specifically exempts it with an unreadably terse exclude_list = "not(contains(., 'List of Un'))" that isn't at all understandable unless you happen to know what is going on!

https://github.com/freelawproject/juriscraper/blob/00c3874519ecb1f95b1676cd41f388595b90d009/juriscraper/opinions/united_states/state/mass.py#L66-L76

johnhawkinson avatar Feb 16 '18 04:02 johnhawkinson

Correct. We do not convert/parse PDFs, and never have, as far as I know. @mlissner will know for sure.

And just so you are clear, we couldn't simply not-exclude those records to magically get what you want. What you are asking is a lot more complex, seeing as it requires converting and parsing random PDF files. I am not saying it is impossible, just that it is a task we don't currently support and handle anywhere else, as far as I know.

P.S. If you think code is "unreadably terse", feel free to submit a pull request with cleaner code. I happen to think a 100+ character single string looks far more confusing:

//a/text()[not(contains(., 'List of Un'))][contains(., '2018)')][contains(., 'SJC ') or contains(., 'SJC-')]

arderyp avatar Feb 16 '18 04:02 arderyp

Yeah, I realize removing the exclusion wouldn't solve the problem.

The "unreadable ternseness" derives not from assembling XPath via Python % string formatting but rather it is that the reader is expected to understand, without context or comment, that 'List of Un' means the code is excluding the list of unpublished decisions from the set of opinion PDFs. I don't think most readers would understand that Un was short for Unpublished or cotton to the significance. Hence...unreadable terseness.

That said, I do think the % assembly of XPath is confusing (but not terse) because it requires following a bunch of layers of indirection and whiplashing between lines 72, 73, 66, 74, 67, 75, 68-70, to read and understand the code. But I'm not proposing a change to that.

johnhawkinson avatar Feb 16 '18 04:02 johnhawkinson

fair enough regarding "Un".

As for the disassembled xpath, the variable names should make it pretty clear what's going on:

"//a/text()[%s][%s][%s]" % ( 
     exclude_list, 
     include_date, 
     include_court 
)
# find any anchors whose text content includes the proper date,
# includes the proper court ID, and does not include the List text

But, I suppose complexity is in the eye of the beholder.

Submit a PR with comments and your proposed variable name and string concatenation changes.

I think @mlissner should field your PDF parsing request... unless you want to submit a PR for that too.

arderyp avatar Feb 16 '18 04:02 arderyp

Unless you want to submit a PR for that to.

The purpose of this issue is tracking that. I imagine I won't have the bandwidth for it for a while, hence opening the issue.

As for the disassembled xpath, the variable names should make it pretty clear what's going on: But, I suppose complexity is in the eye of the beholder.

Complexity and Confusion and Clarity are not necessarily in tension. The observation is that it is introduces levels of indirection, and that indirection requires more stack space to read. I don't claim that it isn't clear or even that it is complex -- I allege it is confusing.

Submit a PR with comments and your proposed variable name and string concatenation changes.

"I'm not proposing a change to that."

johnhawkinson avatar Feb 16 '18 04:02 johnhawkinson

All I'm saying is that it is more productive and respectful to drop the attitude, submit a PR to change "Un" to "Unpublished", and make this issue just about the PDF parsing, instead of about airing your personal frustrations. The bandwidth required to submit that 9 character change is significantly less than that required to type all of the associated text you typed above. Everyone's time is tight, and many of us are volunteering the countless hours we put into this project. We are on the same team.

arderyp avatar Feb 16 '18 05:02 arderyp

All I'm saying is I have time to file an issue and not to file one or more PRs at this time. Noting the issues to be covered in such a PR in the Issue is appropriate, and I think it's important to air all the frustrations a user encounters in their bug report, and to encourage that airing. I covered in the lead issue all the things I anticipate addressing were I to file a PR to address this issue. If you want to split the Issue up into multiple Issues, that's fine by me, but I'd like to report in the way that works best for me.

So I think it's 10x as important to push back on resistance to issues than it is to actually fix bugs or report them. Fostering an environment that encourages reporting is more important than actually fixing bugs, in my view. So I prioritize the debate and discussion much higher than the implementation.

johnhawkinson avatar Feb 16 '18 05:02 johnhawkinson

Bah. I'm tempted to delete all y'all's comments here because reading them just took five minutes.

We can refactor when we get to it, and yes, @arderyp is right that parsing a PDF for additional links is something we've never done because it's horrible. You both are awesome, please carry on otherwise.

mlissner avatar Feb 16 '18 05:02 mlissner

@johnhawkinson same here. This is certainly resolved no?

flooie avatar Dec 31 '22 14:12 flooie

Okay, I just added a PR @johnhawkinson that should fix all of this.

I updated the PR to update Mass and Mass App Ct. 128Archive or the massappct_u scraper should be fixed as well. There is no more parsing of PDFs required on the website either.

flooie avatar Jan 28 '23 15:01 flooie