Remove uses of Selenium and introduce OpinionSiteAspx
This PR builds off the work of @flooie in https://github.com/freelawproject/juriscraper/tree/remove-selenium
That branch removed all uses of Selenium in scrapers and replaced them with simple* requests based scrapers.
My further work in this branch introduces the OpinionSiteAspx class, and refactors all of the newly updated scrapers to extend this class and use its methods. Attempts have been made to document those methods.
NOTE: This is a work in progress and is not ready to be merged (besides the fact that it needs to have master merged onto it). Not all of the scrapers are done.
is the soup moniker derived from the beautifulsoup package name? It seems like you use self.soup where we traditionally might use self.html. It doesn't bother me, I just want to make sure I understand what it is. Is it always interchangeable with self.html... that is, is it always an HTML DOM object that can have xpath executed against it?
Thanks for the awesome work!
is the soup moniker derived from the beautifulsoup package name?
Yes, I believe it is.
I'd be happy to set and use self.html if that's more standard. Do you know the contract on that instance variable? Like if I set it in _get_soup (now _update_html maybe) would it be safe from being overridden or reset?
The soup vs html attribute caught my eye too. self.html usually gets set by the self.parse method in AbstractSite calls AbstractSite._download. Maybe you should override self._download instead of calling this self._get_soup?
good idea @mlissner. for the intial setting of self.html/soup, that usually happens in download. But we also tend to just call download once. So if _get_soup is or may be called multiple times maybe using/overriding _download would be confusing. I don't really care, what do you think @mlissner? I think I've seen a scraper call _download multiple times in iteration, but its certainly not common.
is this still active?
It got hung up in discussions of "spawning multiple downloaders". I think it needs more discussion before it can be finished.
Also it has probably started to rot a bit and there may have been more recent changes to the scrapers involved, as well as the project as a whole, which would need to be painfully merged before it could be merged to master.
Thanks @audiodude. @mlissner, care to chime in?
Sorry, what was the "spawning multiple downloaders" issue, @audiodude?
(I think it shouldn't be too rotten. I bet it'd probably rebase cleanly.)
Basically the problem is that we crawl multiple pages in order to deliver the final lists of case_names, case_dates, download_urls etc.
I think there might have been a misunderstanding here: the list of URLs has every binary PDF once, and they're not requested by the scraper either multiple times or even once.
What does get requested multiple times are the index pages, or the inner iterating pages that list all of the opinions. In all cases I can think of, the scrapers are already set up with self.backward_days = N to avoid scraping the entirety of the site, as well.
can we not just let _download retain its function of making the initial request, to get the landing page, and then create some new function (_download_sub_page or something) that handles iterative requests from there on, which could be overridden on child scrapers if need be?
I'm fine with a simpler design. I just want to make sure that from the caller's perspective, it can easily check if a page has changed without having to download any subpages. There's really three steps to some of these, right:
- Download the landing page
- Follow links on it to sub pages
- Download binaries from there
I want to be able to do only number 1 without doing 2 or 3 unless number 1 has changed. Right now that works via the DeferredList thing/monstrosity, and I'm happy to do it another (smarter) way that's less confusing. The DeferredList was my solution to this many years ago, and it has always felt kind of frightening.
I think that many times with these sites, the landing page is just a search box and the scraper generates a search query in order to get to the sub pages. So you don't know until you've fetched the sub pages if they are the same.
I'm also confused about (3). I don't see anywhere that scrapers download PDFs, they simply return links to them in _get_download_urls. And presumably that's when CL compares the URLs to what it already has and decides not to download, in DeferredList or whatever.
So what I'm saying is that I think the current implementations of OpinionSiteAspx are fine.
Also, as another point of conversation, the original kan site got refactored into kan_p and kan_u after this PR was in progress. It's the last instance of OpinionSiteLinearWebDriven, but I propose leaving it for now and cleaning it up in a follow up PR.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.
:white_check_mark: flooie
:x: audiodude
You have signed the CLA already but the status is still pending? Let us recheck it.
This is kinda nice, maybe we should revive this PR or integrate the - set local variables
maybe we should revive this PR
Definitely not a priority for FLP staff to do this work, but less Selenium is usually more happiness.
I can take a look at rebasing this to HEAD. Have any new Selenium/ASPX scrapers been added since Dec 2021?
Looks like all the sites I updated to remove Selenium are using OpinionSiteLinear now and not using Selenium
Huh, so...shall we close this and open a new issue to remove Selenium entirely?
@mlissner - unfortunately, we can't currently remove selenium entirely. We are stuck with it for Alabama and Colorado due to the PDF issue. But I agree, I like the ASPX code.