Support getting multiple fields from a secondary page
The current method to get data from a secondary page is to use a DeferringList. This method is designed for parsing a single field. However, we may want to get multiple fields from the same page, as we saw most recently in a nev issue. In its current implementation, that would mean requesting the same page many times, which is not desirable.
I list 3 possible solutions to this problem. The first one is the easiest. The second one is hacky. I would prefer the third one, but it takes work beyond the DeferringList proper
Store DeferringList's HTML
A possible solution would be to store the url's html in the site object. This does not require any changes to the current code, just a case by case implementation. Then, define one DeferringList for each attribute we may want. It may cause memory issues in some backscrapers, if the backlog is big
class Site(OpinionSiteLinear):
url_to_deferred_html = {}
...
def _get_case_names(self) -> List[str]:
def fetcher(case):
if case["name"] != "":
# Return the name we extracted without using fetcher
return case["name"]
elif self.test_mode_enabled():
# if we're in test mode, return a dummy name
return "Test Name"
else:
# Else, query the API and return the name of the case
self.url = f"https://www.courts.michigan.gov/api/CaseSearch/SearchCaseSearchContent/?searchQuery={case['title']}"
self.html = self._download()
if not self.url_to_deferred_html.get(self.url):
self.url_to_deferred_html[self.url] = self.html
case["name"] = self.html["opinionResults"]["searchItems"][0][
"title"
].title()
return case["name"]
return DeferringList(seed=self.cases, fetcher=fetcher)
Modify case inside a single DeferringList
A solution that requires no storing would be to modify the case object already passed to the fetcher function. However, this would be awkward on the AbstractSite._clean_attributes step, since it is supposed that the DeferringList is not executed yet. However, there is currently a bug in the behavior of DeferringList, so this would work
Abstract away the AbstractSite lists pattern
In the end, courtlistener iterates over objects or python dictionaries. We could abstract away the interface of AbstractSite and buid an AbstractSite that manages a list of objects, not lists of attributes. This, while still supporting the list paradigm for current and old scrapers. We could create a new DeferringClass that updates many fields at the same time, and that interacts with the list of objects architecture. That way, updating multiple fields of a single object would be trivial
Option three makes sense to me. In other words, we have Site objects, which yield Case objects or something like that?
I took some time writing this, without complete testing, but it runs and works as a concept. The code can be seen here: https://github.com/freelawproject/juriscraper/compare/main...grossir:juriscraper:new_opinion_site_subclass?expand=1
Basically, it is a new class for OpinionSite. it inherits from AbstractSite but overrides __iter__, __getitem__, __len__ and parse. Taking care of keeping the same interface: for example, the ordering function is important because it affects the value of the hash
I also tested a new way to get multiple deferring values on a single call, but couldn't manage to keep the convenience of __geitem__ from DeferringList without it being buggy. I chose a explicit function that must be used to consume the deferring values, if they exist. There is an example of this working with nev, and I think it looks good and has less boilerplate.
Finally, in order to replace AbstractSite._check_sanity I took the opportunity to try out the JSON Schema Validator, for which I will write a longer comment on #838 . To run, install pip install jsonschema==4.21.1. It will not be appreciated too much on this branch, since it raises an Exception for the "deferred" fields, since they should be strings and they are functions. This can be solved by writing a custom validator, but it requires more work.
This looks like a generally good direction to me. One thing I wonder is whether we should go all-in on this, and leave all our old parsers behind. It's kind of a bummer that this would mean we have three generations of object in the codebase:
- This (gen 3)
- AbstractSite (gen 1)
- AbstractSiteLinear (gen 2)
Hm...
I think OpinionSiteLinear sites can be updated to this new base class easily, since it is following some of the same usage conventions (using self.sites to store records; conversion of short to long names on the base class), as shown on the nev example.
1st Gen would be harder to change and require more testing.
So, we could still keep 2 classes and update the OpinioSite / 1st gen sites on a case by case basis, as they get outdated
Sounds good. Down with 1st Gen OpinionSite!
The rise of the clustersite