colly icon indicating copy to clipboard operation
colly copied to clipboard

Feature Request: JavaScript execution

Open vosmith opened this issue 7 years ago • 30 comments

Any plan to add JavaScript engines to this framework. A few projects that might help are

https://github.com/robertkrimen/otto https://github.com/lazytiger/go-v8 https://github.com/dop251/goja

I'm not sure if this aligns with your goals for the project, but it could be almost necessary on alot of modern websites.

vosmith avatar Oct 06 '17 00:10 vosmith

I'm not planning to add remote code execution to Colly. Headless browsers are perfect for this purpose.

Perhaps it would make the framework a bit easier to use in some cases, but it would also introduce big computational overhead and the scraping jobs would be less reliable.

JS-only sites are smoothly scrapeable with static scraping frameworks too.

asciimoo avatar Oct 06 '17 11:10 asciimoo

JS-only sites are smoothly scrapeable with static scraping frameworks too

I think it would be very useful to see an example of this using Colly!

ilyaglow avatar Nov 20 '17 08:11 ilyaglow

Here is a detailed example: http://go-colly.org/articles/how_to_scrape_instagram/ (https://github.com/gocolly/colly/tree/master/_examples/instagram)

asciimoo avatar Nov 20 '17 09:11 asciimoo

@asciimoo that is just not true for all sites though. I understand that implementing a headless browser is a tremendous task, maybe there is some way to add a selenium implementation as an extension?

manugarri avatar May 14 '18 08:05 manugarri

@manugarri I like the idea of an extension which can communicate with a headless browser. Would you like to implement it?

asciimoo avatar May 14 '18 08:05 asciimoo

@asciimoo would it make sense to add an already existing implementation? I havent checked colly's code in depth, so not sure if any of the golang implementations would work (for example this, this or this

How would the implementation fit? in terms of project structure, do you have any specific structure ideas in mind? I dont mind doing the grunt work when I have the time.

manugarri avatar May 14 '18 08:05 manugarri

@manugarri I don't have experience in this field, I'm open to any ideas. Probably, https://github.com/scrapy-plugins/scrapy-splash is a good base to study a working solution.

asciimoo avatar May 14 '18 08:05 asciimoo

Hmm, scrapy-splash works by sending requests to a splash server, that you usually run with docker. I dont think that implementation would make sense here, right?

It would be easier to implement (colly would just need to do get requests to the splash-server to get the rendered js) but that would mean the server running colly would need either docker or python installed, which is a pretty big overhead IMHO.

manugarri avatar May 14 '18 08:05 manugarri

I think the easiest way would be to have a separate repo named colly-contrib or similar, where you could define extensions similar to the extensions package but that would be optional. That way whomever wants to scrape js can do it without forcing everyone to install it (its usually an overhead, since you have to download the browser driver separately).

manugarri avatar May 14 '18 08:05 manugarri

an extension that use Headless Chrome ?

LeMoussel avatar May 14 '18 12:05 LeMoussel

Either that or geckodriver, yeah

manugarri avatar May 15 '18 07:05 manugarri

I also think this is a good idea to make some kind of plugin system. Perhaps converting the http_backend into some sort of interface. As for Headless Chrome vs Gecko Driver/Servo, as long as we target the the W3C Webdriver spec, it should open things up to alot of possibilities.

as @manugarri ~mentioneds~mentioned, there a many Golang implementations of the Webdriver client, but I don't have any personal experience with them.

Thoughts?

vosmith avatar May 16 '18 01:05 vosmith

I think chromedp is a solution studied.

Package chromedp is a faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol.

LeMoussel avatar May 16 '18 05:05 LeMoussel

@LeMoussel looks like a good solution that supports multiple browsers. @asciimoo , should this issue be re-opened?

vosmith avatar May 16 '18 17:05 vosmith

@asciimoo , @manugarri , @LeMoussel . I've opened PR #148 in an attempt to created an HTTPDriver interface to open up to possibility of using external browsers. When you get some time please review it.

Thanks!

vosmith avatar May 18 '18 16:05 vosmith

Hi @vosmith , can you maybe give some example of how to use this HTTPDriver, regards.

alejoloaiza avatar May 19 '18 15:05 alejoloaiza

@alejoloaiza the first implementation of the HTTPDriver interface is in design right now. We literally just started the PR yesterday so it hasn't been merged into master yet. You can check out the work on #148 and if you want to add your own implementations, feel free to pull my branch and play around.

vosmith avatar May 19 '18 21:05 vosmith

@LeMoussel I was checking on chromedp, and it actually opens the chrome browser and does the automation on it. Headless chrome on the other side does everything on background.

alejoloaiza avatar May 21 '18 19:05 alejoloaiza

@alejoloaiza is there any way to run chromedp headless? We would need headless implementation for this project.

manugarri avatar May 21 '18 21:05 manugarri

@manugarri There is a way using the flags runner.Flag("headless",true),

some issues are reported on this here, however I will test more and let you know.

alejoloaiza avatar May 21 '18 22:05 alejoloaiza

Hi @manugarri , I have done several testing of chromedp, I have done a flight scrapper on skyscanner using it which you can check here, but I have to tell you that when I run it headless, using the line runner.Flag("headless",true),, It doesnt work. So headless changes the behavior completely, also just an additional comment is that many websites have validations for headless as the useragent changes. I will try to test in other websites as well, because I was only testing there these days.

alejoloaiza avatar May 26 '18 15:05 alejoloaiza

@alejoloaiza , You can change UserAgent with the runner options. You can look for UserAgent option at https://github.com/chromedp/chromedp/blob/e57a331e5c3c3b51ba749c196f092966b9ae233e/runner/runner.go#L393

For example :

cdp.New(ctxt, cdp.WithRunnerOptions(
            runner.UserAgent("<your User Agent>"),
        ))

LeMoussel avatar May 27 '18 06:05 LeMoussel

@alejoloaiza awesome, will check it out. Like @LeMoussel said, I think you could probably change chromedp user agent headless right? If not, then I would say chromedp is not the right solution for scraping.

manugarri avatar May 27 '18 08:05 manugarri

Whats the status? On that, any experimental branches?

sp3c1 avatar Jul 23 '18 14:07 sp3c1

@sp3c1 I have done some playing around in a branch at vosmith/colly:http_backend. That branch has some architectural changes that opens the http_backend up as an interface. I also have a project at vosmith/colly_cdpdriver that works with the Google Chrome browser only and runs in the foreground.

I haven't been able to make much progress on it with some of my other priorities, but FWIW it is functional.

vosmith avatar Jul 30 '18 12:07 vosmith

Because this does not support JS, i changed to this: github.com/MontFerret/ferret

Miht help others And might also help Colly to look at this to get this functionality.

ghost avatar Oct 08 '18 12:10 ghost

Guys, is there any progress on this?

danaki avatar Aug 04 '19 21:08 danaki

Colly isn't outdated. It is a static scraper/crawler framework and it is not a headless browser nor a wrapper around a headless browser. If you have to render pages, then use a headless browser, but JS only sites usually use a json/rest api, which can be handled without scraping. Also, I'm still open to add rendering features to the colly ecosystem but only as an independent extension as I wrote in an older comment. Headless browsers can render pages but in the network level they are much less configurable than colly. Both solutions have their own use-cases.

asciimoo avatar May 19 '20 19:05 asciimoo

@asciimoo What about adding colly.JSVisit function that loads page and returns html of JS rendered page?

That can be done using some of the headless clients At this link i was found a function

//Get the data crawled from the website
func GetHttpHtmlContent(url string, selector string, sel interface{}) (string, error) {
    options := []chromedp.ExecAllocatorOption{
        Chromedp. Flag ("headless", true), // debug
        chromedp.Flag("blink-settings", "imagesEnabled=false"),
        chromedp.UserAgent(`Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36`),
    }
    //Initialization parameters, first pass an empty data    
    options = append(chromedp.DefaultExecAllocatorOptions[:], options...)

    c, _ := chromedp.NewExecAllocator(context.Background(), options...)

    // create context
    chromeCtx, cancel := chromedp.NewContext(c, chromedp.WithLogf(log.Printf))
    //Execute an empty task and create a chrome instance in advance
    chromedp.Run(chromeCtx, make([]chromedp.Action, 0, 1)...)

    //Create a context with a timeout of 40s
    timeoutCtx, cancel := context.WithTimeout(chromeCtx, 40*time.Second)
    defer cancel()

    var htmlContent string
    err := chromedp.Run(timeoutCtx,
        chromedp.Navigate(url),
        chromedp.WaitVisible(selector),
        chromedp.OuterHTML(sel, &htmlContent, chromedp.ByJSPath),
    )
    if err != nil {
        logger.Info("Run err : %v\n", err)
        return "", err
    }
    //log.Println(htmlContent)

    return htmlContent, nil
}

Maybe with little changes it can be added to Colly as extension???

For Example gocolly/jsvisit

sahakkhotsanyan avatar Dec 11 '21 04:12 sahakkhotsanyan