colly
colly copied to clipboard
Feature Request: JavaScript execution
Any plan to add JavaScript engines to this framework. A few projects that might help are
https://github.com/robertkrimen/otto https://github.com/lazytiger/go-v8 https://github.com/dop251/goja
I'm not sure if this aligns with your goals for the project, but it could be almost necessary on alot of modern websites.
I'm not planning to add remote code execution to Colly. Headless browsers are perfect for this purpose.
Perhaps it would make the framework a bit easier to use in some cases, but it would also introduce big computational overhead and the scraping jobs would be less reliable.
JS-only sites are smoothly scrapeable with static scraping frameworks too.
JS-only sites are smoothly scrapeable with static scraping frameworks too
I think it would be very useful to see an example of this using Colly!
Here is a detailed example: http://go-colly.org/articles/how_to_scrape_instagram/ (https://github.com/gocolly/colly/tree/master/_examples/instagram)
@asciimoo that is just not true for all sites though. I understand that implementing a headless browser is a tremendous task, maybe there is some way to add a selenium implementation as an extension?
@manugarri I like the idea of an extension which can communicate with a headless browser. Would you like to implement it?
@asciimoo would it make sense to add an already existing implementation? I havent checked colly's code in depth, so not sure if any of the golang implementations would work (for example this, this or this
How would the implementation fit? in terms of project structure, do you have any specific structure ideas in mind? I dont mind doing the grunt work when I have the time.
@manugarri I don't have experience in this field, I'm open to any ideas. Probably, https://github.com/scrapy-plugins/scrapy-splash is a good base to study a working solution.
Hmm, scrapy-splash works by sending requests to a splash server, that you usually run with docker. I dont think that implementation would make sense here, right?
It would be easier to implement (colly would just need to do get requests to the splash-server to get the rendered js) but that would mean the server running colly would need either docker or python installed, which is a pretty big overhead IMHO.
I think the easiest way would be to have a separate repo named colly-contrib
or similar, where you could define extensions similar to the extensions
package but that would be optional. That way whomever wants to scrape js can do it without forcing everyone to install it (its usually an overhead, since you have to download the browser driver separately).
an extension that use Headless Chrome ?
Either that or geckodriver, yeah
I also think this is a good idea to make some kind of plugin system. Perhaps converting the http_backend
into some sort of interface. As for Headless Chrome vs Gecko Driver/Servo, as long as we target the the W3C Webdriver spec, it should open things up to alot of possibilities.
as @manugarri ~mentioneds~mentioned, there a many Golang implementations of the Webdriver client, but I don't have any personal experience with them.
Thoughts?
I think chromedp is a solution studied.
Package chromedp is a faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol.
@LeMoussel looks like a good solution that supports multiple browsers. @asciimoo , should this issue be re-opened?
@asciimoo , @manugarri , @LeMoussel . I've opened PR #148 in an attempt to created an HTTPDriver interface to open up to possibility of using external browsers. When you get some time please review it.
Thanks!
Hi @vosmith , can you maybe give some example of how to use this HTTPDriver, regards.
@alejoloaiza the first implementation of the HTTPDriver interface is in design right now. We literally just started the PR yesterday so it hasn't been merged into master yet. You can check out the work on #148 and if you want to add your own implementations, feel free to pull my branch and play around.
@LeMoussel I was checking on chromedp, and it actually opens the chrome browser and does the automation on it. Headless chrome on the other side does everything on background.
@alejoloaiza is there any way to run chromedp headless? We would need headless implementation for this project.
@manugarri There is a way using the flags
runner.Flag("headless",true),
some issues are reported on this here, however I will test more and let you know.
Hi @manugarri , I have done several testing of chromedp, I have done a flight scrapper on skyscanner using it which you can check here, but I have to tell you that when I run it headless, using the line runner.Flag("headless",true),,
It doesnt work. So headless changes the behavior completely, also just an additional comment is that many websites have validations for headless as the useragent changes. I will try to test in other websites as well, because I was only testing there these days.
@alejoloaiza , You can change UserAgent
with the runner options.
You can look for UserAgent
option at https://github.com/chromedp/chromedp/blob/e57a331e5c3c3b51ba749c196f092966b9ae233e/runner/runner.go#L393
For example :
cdp.New(ctxt, cdp.WithRunnerOptions(
runner.UserAgent("<your User Agent>"),
))
@alejoloaiza awesome, will check it out. Like @LeMoussel said, I think you could probably change chromedp user agent headless right? If not, then I would say chromedp is not the right solution for scraping.
Whats the status? On that, any experimental branches?
@sp3c1 I have done some playing around in a branch at vosmith/colly:http_backend
. That branch has some architectural changes that opens the http_backend
up as an interface. I also have a project at vosmith/colly_cdpdriver
that works with the Google Chrome browser only and runs in the foreground.
I haven't been able to make much progress on it with some of my other priorities, but FWIW it is functional.
Because this does not support JS, i changed to this: github.com/MontFerret/ferret
Miht help others And might also help Colly to look at this to get this functionality.
Guys, is there any progress on this?
Colly isn't outdated. It is a static scraper/crawler framework and it is not a headless browser nor a wrapper around a headless browser. If you have to render pages, then use a headless browser, but JS only sites usually use a json/rest api, which can be handled without scraping. Also, I'm still open to add rendering features to the colly ecosystem but only as an independent extension as I wrote in an older comment. Headless browsers can render pages but in the network level they are much less configurable than colly. Both solutions have their own use-cases.
@asciimoo What about adding colly.JSVisit function that loads page and returns html of JS rendered page?
That can be done using some of the headless clients At this link i was found a function
//Get the data crawled from the website
func GetHttpHtmlContent(url string, selector string, sel interface{}) (string, error) {
options := []chromedp.ExecAllocatorOption{
Chromedp. Flag ("headless", true), // debug
chromedp.Flag("blink-settings", "imagesEnabled=false"),
chromedp.UserAgent(`Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36`),
}
//Initialization parameters, first pass an empty data
options = append(chromedp.DefaultExecAllocatorOptions[:], options...)
c, _ := chromedp.NewExecAllocator(context.Background(), options...)
// create context
chromeCtx, cancel := chromedp.NewContext(c, chromedp.WithLogf(log.Printf))
//Execute an empty task and create a chrome instance in advance
chromedp.Run(chromeCtx, make([]chromedp.Action, 0, 1)...)
//Create a context with a timeout of 40s
timeoutCtx, cancel := context.WithTimeout(chromeCtx, 40*time.Second)
defer cancel()
var htmlContent string
err := chromedp.Run(timeoutCtx,
chromedp.Navigate(url),
chromedp.WaitVisible(selector),
chromedp.OuterHTML(sel, &htmlContent, chromedp.ByJSPath),
)
if err != nil {
logger.Info("Run err : %v\n", err)
return "", err
}
//log.Println(htmlContent)
return htmlContent, nil
}
Maybe with little changes it can be added to Colly as extension???
For Example gocolly/jsvisit