juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Remove uses of Selenium and introduce OpinionSiteAspx

Open audiodude opened this issue 6 years ago • 14 comments

This PR builds off the work of @flooie in https://github.com/freelawproject/juriscraper/tree/remove-selenium

That branch removed all uses of Selenium in scrapers and replaced them with simple* requests based scrapers.

My further work in this branch introduces the OpinionSiteAspx class, and refactors all of the newly updated scrapers to extend this class and use its methods. Attempts have been made to document those methods.

NOTE: This is a work in progress and is not ready to be merged (besides the fact that it needs to have master merged onto it). Not all of the scrapers are done.

audiodude avatar Mar 07 '20 00:03 audiodude

is the soup moniker derived from the beautifulsoup package name? It seems like you use self.soup where we traditionally might use self.html. It doesn't bother me, I just want to make sure I understand what it is. Is it always interchangeable with self.html... that is, is it always an HTML DOM object that can have xpath executed against it?

Thanks for the awesome work!

arderyp avatar Mar 07 '20 00:03 arderyp

is the soup moniker derived from the beautifulsoup package name?

Yes, I believe it is.

I'd be happy to set and use self.html if that's more standard. Do you know the contract on that instance variable? Like if I set it in _get_soup (now _update_html maybe) would it be safe from being overridden or reset?

audiodude avatar Mar 07 '20 00:03 audiodude

The soup vs html attribute caught my eye too. self.html usually gets set by the self.parse method in AbstractSite calls AbstractSite._download. Maybe you should override self._download instead of calling this self._get_soup?

mlissner avatar Mar 07 '20 00:03 mlissner

good idea @mlissner. for the intial setting of self.html/soup, that usually happens in download. But we also tend to just call download once. So if _get_soup is or may be called multiple times maybe using/overriding _download would be confusing. I don't really care, what do you think @mlissner? I think I've seen a scraper call _download multiple times in iteration, but its certainly not common.

arderyp avatar Mar 07 '20 01:03 arderyp

is this still active?

arderyp avatar May 06 '20 05:05 arderyp

It got hung up in discussions of "spawning multiple downloaders". I think it needs more discussion before it can be finished.

Also it has probably started to rot a bit and there may have been more recent changes to the scrapers involved, as well as the project as a whole, which would need to be painfully merged before it could be merged to master.

audiodude avatar May 06 '20 05:05 audiodude

Thanks @audiodude. @mlissner, care to chime in?

arderyp avatar May 06 '20 05:05 arderyp

Sorry, what was the "spawning multiple downloaders" issue, @audiodude?

(I think it shouldn't be too rotten. I bet it'd probably rebase cleanly.)

mlissner avatar May 06 '20 22:05 mlissner

Basically the problem is that we crawl multiple pages in order to deliver the final lists of case_names, case_dates, download_urls etc.

I think there might have been a misunderstanding here: the list of URLs has every binary PDF once, and they're not requested by the scraper either multiple times or even once.

What does get requested multiple times are the index pages, or the inner iterating pages that list all of the opinions. In all cases I can think of, the scrapers are already set up with self.backward_days = N to avoid scraping the entirety of the site, as well.

audiodude avatar May 06 '20 23:05 audiodude

can we not just let _download retain its function of making the initial request, to get the landing page, and then create some new function (_download_sub_page or something) that handles iterative requests from there on, which could be overridden on child scrapers if need be?

arderyp avatar May 07 '20 04:05 arderyp

I'm fine with a simpler design. I just want to make sure that from the caller's perspective, it can easily check if a page has changed without having to download any subpages. There's really three steps to some of these, right:

  1. Download the landing page
  2. Follow links on it to sub pages
  3. Download binaries from there

I want to be able to do only number 1 without doing 2 or 3 unless number 1 has changed. Right now that works via the DeferredList thing/monstrosity, and I'm happy to do it another (smarter) way that's less confusing. The DeferredList was my solution to this many years ago, and it has always felt kind of frightening.

mlissner avatar May 07 '20 06:05 mlissner

I think that many times with these sites, the landing page is just a search box and the scraper generates a search query in order to get to the sub pages. So you don't know until you've fetched the sub pages if they are the same.

I'm also confused about (3). I don't see anywhere that scrapers download PDFs, they simply return links to them in _get_download_urls. And presumably that's when CL compares the URLs to what it already has and decides not to download, in DeferredList or whatever.

So what I'm saying is that I think the current implementations of OpinionSiteAspx are fine.

audiodude avatar May 07 '20 14:05 audiodude

Also, as another point of conversation, the original kan site got refactored into kan_p and kan_u after this PR was in progress. It's the last instance of OpinionSiteLinearWebDriven, but I propose leaving it for now and cleaning it up in a follow up PR.

audiodude avatar May 07 '20 14:05 audiodude

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: flooie
:x: audiodude
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Dec 18 '21 00:12 CLAassistant

This is kinda nice, maybe we should revive this PR or integrate the - set local variables

flooie avatar Jan 15 '23 00:01 flooie

maybe we should revive this PR

Definitely not a priority for FLP staff to do this work, but less Selenium is usually more happiness.

mlissner avatar Jan 19 '23 17:01 mlissner

I can take a look at rebasing this to HEAD. Have any new Selenium/ASPX scrapers been added since Dec 2021?

audiodude avatar Jan 19 '23 19:01 audiodude

Looks like all the sites I updated to remove Selenium are using OpinionSiteLinear now and not using Selenium

audiodude avatar Jan 19 '23 19:01 audiodude

Huh, so...shall we close this and open a new issue to remove Selenium entirely?

mlissner avatar Jan 19 '23 20:01 mlissner

@mlissner - unfortunately, we can't currently remove selenium entirely. We are stuck with it for Alabama and Colorado due to the PDF issue. But I agree, I like the ASPX code.

flooie avatar Jan 19 '23 20:01 flooie