juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Support getting multiple fields from a secondary page

Open grossir opened this issue 2 years ago • 6 comments

The current method to get data from a secondary page is to use a DeferringList. This method is designed for parsing a single field. However, we may want to get multiple fields from the same page, as we saw most recently in a nev issue. In its current implementation, that would mean requesting the same page many times, which is not desirable.

I list 3 possible solutions to this problem. The first one is the easiest. The second one is hacky. I would prefer the third one, but it takes work beyond the DeferringList proper

Store DeferringList's HTML

A possible solution would be to store the url's html in the site object. This does not require any changes to the current code, just a case by case implementation. Then, define one DeferringList for each attribute we may want. It may cause memory issues in some backscrapers, if the backlog is big

class Site(OpinionSiteLinear):
    url_to_deferred_html = {}

    ...

    def _get_case_names(self) -> List[str]:
        def fetcher(case):
            if case["name"] != "":
                # Return the name we extracted without using fetcher
                return case["name"]
            elif self.test_mode_enabled():
                # if we're in test mode, return a dummy name
                return "Test Name"
            else:
                # Else, query the API and return the name of the case
                self.url = f"https://www.courts.michigan.gov/api/CaseSearch/SearchCaseSearchContent/?searchQuery={case['title']}"
                self.html = self._download()

                if not self.url_to_deferred_html.get(self.url):
                       self.url_to_deferred_html[self.url] = self.html

                case["name"] = self.html["opinionResults"]["searchItems"][0][
                    "title"
                ].title()
            return case["name"]

        return DeferringList(seed=self.cases, fetcher=fetcher)

Modify case inside a single DeferringList

A solution that requires no storing would be to modify the case object already passed to the fetcher function. However, this would be awkward on the AbstractSite._clean_attributes step, since it is supposed that the DeferringList is not executed yet. However, there is currently a bug in the behavior of DeferringList, so this would work

Abstract away the AbstractSite lists pattern

In the end, courtlistener iterates over objects or python dictionaries. We could abstract away the interface of AbstractSite and buid an AbstractSite that manages a list of objects, not lists of attributes. This, while still supporting the list paradigm for current and old scrapers. We could create a new DeferringClass that updates many fields at the same time, and that interacts with the list of objects architecture. That way, updating multiple fields of a single object would be trivial

grossir avatar Jan 25 '24 17:01 grossir

Option three makes sense to me. In other words, we have Site objects, which yield Case objects or something like that?

mlissner avatar Jan 25 '24 17:01 mlissner

I took some time writing this, without complete testing, but it runs and works as a concept. The code can be seen here: https://github.com/freelawproject/juriscraper/compare/main...grossir:juriscraper:new_opinion_site_subclass?expand=1

Basically, it is a new class for OpinionSite. it inherits from AbstractSite but overrides __iter__, __getitem__, __len__ and parse. Taking care of keeping the same interface: for example, the ordering function is important because it affects the value of the hash

I also tested a new way to get multiple deferring values on a single call, but couldn't manage to keep the convenience of __geitem__ from DeferringList without it being buggy. I chose a explicit function that must be used to consume the deferring values, if they exist. There is an example of this working with nev, and I think it looks good and has less boilerplate.

Finally, in order to replace AbstractSite._check_sanity I took the opportunity to try out the JSON Schema Validator, for which I will write a longer comment on #838 . To run, install pip install jsonschema==4.21.1. It will not be appreciated too much on this branch, since it raises an Exception for the "deferred" fields, since they should be strings and they are functions. This can be solved by writing a custom validator, but it requires more work.

grossir avatar Jan 27 '24 01:01 grossir

This looks like a generally good direction to me. One thing I wonder is whether we should go all-in on this, and leave all our old parsers behind. It's kind of a bummer that this would mean we have three generations of object in the codebase:

  • This (gen 3)
  • AbstractSite (gen 1)
  • AbstractSiteLinear (gen 2)

Hm...

mlissner avatar Jan 27 '24 16:01 mlissner

I think OpinionSiteLinear sites can be updated to this new base class easily, since it is following some of the same usage conventions (using self.sites to store records; conversion of short to long names on the base class), as shown on the nev example.

1st Gen would be harder to change and require more testing.

So, we could still keep 2 classes and update the OpinioSite / 1st gen sites on a case by case basis, as they get outdated

grossir avatar Jan 29 '24 16:01 grossir

Sounds good. Down with 1st Gen OpinionSite!

mlissner avatar Jan 29 '24 16:01 mlissner

The rise of the clustersite

flooie avatar Jan 30 '24 16:01 flooie