crawler
crawler copied to clipboard
Can you make it as an external DSL ?
For example a file that contains the dsl and app can run the dsl and returns the result .
That's not really a use case that I have, as what I am doing requires that the crawler DSL work alongside the rest of my Scala app. I'm certainly willing to accept any pull requests though!
I'm also interested and looking for guidance: What is the most atomic expression to feed into a Crawler instance? After each consumed expression, it should be possible to inspect the nodeStack.
My initial guess is simply a (Function1[Unit, => Unit], ElementProcessor)
, but then I'm yet confused by the required stack state and varying types for the entry points (in, from, forAll).
Meanwhile, I bumped the version of HtmlUnit and made the crawler work for 2.21.
@ahirner Thanks for bumping the HtmlUnit version. Out of curiosity, are you able to get the crawler to work with JavaScript heavy pages that are using, e.g. AngularJS or other library to render the front end?
The various entry points are indeed confusing, and I would probably not choose to write that code the same way if I were starting over on this project again today. Sorry for the uninformative response, it has long been my desire to start this codebase over again now that I have a couple more years of Scala experience under my belt...
@bplawler the code is highly educating, e.g. how to handle explicit type conversions that are necessary with HtmlUnit. Thanks! For my use case and up until now, it handled quirky JS pages just fine. This includes an SPA that streams the DOM in a quite old-fashioned way. In order to scrape such cases, I first injected basic JS query functions which ought to be used uniformly across sites. I haven't yet tested the Rhino/HtmlUnit combo with bleeding edge or more heavy-weight frontends.