juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

feat(mass, massappct): backscraper for `masscases.com`

Open grossir opened this issue 1 year ago • 3 comments

Helps solve #984

grossir avatar Apr 16 '24 01:04 grossir

I have put the backscraper on the same file as the scraper, even when it targets a different source. After finishing it, I actually think the backscraper should be on its own file, because it targets a different site, and uses an exclusive extract_from_text. What do you think? Maybe juriscraper/opinions/united_states/state/mass_backscraper.py?

I normally wouldn't use juriscraper/opinions/united_states_backscrapers/mass.py since it is awkward to have two folders for what is basically the same thing (scraping opinions past or present), but in this case it seems it could also be a proper place for the scripts...

grossir avatar Apr 16 '24 14:04 grossir

One of the reasons I liked the united_states_backscraper directory is because you can put one-off scripts in there and not worry too much if they go stale or stop working.

mlissner avatar Apr 16 '24 15:04 mlissner

I tested this today and found it crashed

Traceback (most recent call last):
  File "/Users/Palin/Code/juriscraper/sample_caller.py", line 246, in main
    for site in site_yielder(
  File "/Users/Palin/Code/juriscraper/juriscraper/lib/importer.py", line 79, in site_yielder
    site._download_backwards(i)
  File "/Users/Palin/Code/juriscraper/juriscraper/opinions/united_states_backscrapers/state/mass.py", line 103, in _download_backwards
    self._process_html()
  File "/Users/Palin/Code/juriscraper/juriscraper/opinions/united_states_backscrapers/state/mass.py", line 55, in _process_html
    _, date_filed_str, name = row.xpath("td/text()")
ValueError: too many values to unpack (expected 3)

on the live site after a few iterations when it gets to 2024-07-02 12:24:55,029 - INFO: Now downloading case page at: http://masscases.com/425-449.html

flooie avatar Jul 02 '24 16:07 flooie

@grossir sorry about the conflicts here but can you resolve them and re-PR this

flooie avatar Jul 08 '24 20:07 flooie

Just updated this to account for some edge cases, please check again @flooie

grossir avatar Jul 09 '24 16:07 grossir

Well, after debugging in integration with Courtlistener, just found a bug that causes extract_from_text from any scraper in the united_states_backscraper folder to never be used https://github.com/freelawproject/courtlistener/issues/4193

Will have to solve that before merging this, since we rely on extract_from_text to get docket numbers

grossir avatar Jul 10 '24 21:07 grossir