feat(mass, massappct): backscraper for `masscases.com`
Helps solve #984
I have put the backscraper on the same file as the scraper, even when it targets a different source. After finishing it, I actually think the backscraper should be on its own file, because it targets a different site, and uses an exclusive extract_from_text. What do you think? Maybe juriscraper/opinions/united_states/state/mass_backscraper.py?
I normally wouldn't use juriscraper/opinions/united_states_backscrapers/mass.py since it is awkward to have two folders for what is basically the same thing (scraping opinions past or present), but in this case it seems it could also be a proper place for the scripts...
One of the reasons I liked the united_states_backscraper directory is because you can put one-off scripts in there and not worry too much if they go stale or stop working.
I tested this today and found it crashed
Traceback (most recent call last):
File "/Users/Palin/Code/juriscraper/sample_caller.py", line 246, in main
for site in site_yielder(
File "/Users/Palin/Code/juriscraper/juriscraper/lib/importer.py", line 79, in site_yielder
site._download_backwards(i)
File "/Users/Palin/Code/juriscraper/juriscraper/opinions/united_states_backscrapers/state/mass.py", line 103, in _download_backwards
self._process_html()
File "/Users/Palin/Code/juriscraper/juriscraper/opinions/united_states_backscrapers/state/mass.py", line 55, in _process_html
_, date_filed_str, name = row.xpath("td/text()")
ValueError: too many values to unpack (expected 3)
on the live site after a few iterations when it gets to 2024-07-02 12:24:55,029 - INFO: Now downloading case page at: http://masscases.com/425-449.html
@grossir sorry about the conflicts here but can you resolve them and re-PR this
Just updated this to account for some edge cases, please check again @flooie
Well, after debugging in integration with Courtlistener, just found a bug that causes extract_from_text from any scraper in the united_states_backscraper folder to never be used https://github.com/freelawproject/courtlistener/issues/4193
Will have to solve that before merging this, since we rely on extract_from_text to get docket numbers