colly icon indicating copy to clipboard operation
colly copied to clipboard

User-agent switching doesn't work with Proxy

Open soanni opened this issue 5 years ago • 6 comments

Hi there, thank you for your amazing job, it's really a great framework! I'm scratching my head several days but can't understand what's wrong ... I'm using HTTP Forward Proxy (Squid) and User-Agent switching from Colly extensions but in the Squid logs i can see that User-Agent header is default Golang-user-agent-1.1. However in OnRequest hook i can see that user-agent switching does happens, moreover the User-Agent is still custom even in http_backend.go.Do() method.

func (h *httpBackend) Do(request *http.Request, bodySize int) (*Response, error) {
        fmt.Println(request) // I'm logging to check the User-Agent is custom
	res, err := h.Client.Do(request)

But on Squid side i see 'Golang-user-agent-1.1' for every request. I suspect that smth happens with User-Agent header on net/http side particularly when you use Proxy. The code is below (i tried with ProxySwicther extension but still no luck)

       c := colly.NewCollector(
		colly.Debugger(&debug.LogDebugger{}),
		colly.Async(true),
	)
	c.WithTransport(&http.Transport{
		Proxy: func(pr *http.Request) (*url.URL, error){
			parsedU, err := url.Parse(viper.GetString("squid"))
			if err != nil {
				return nil, err
			}
			return parsedU, nil
		},
    	        DisableKeepAlives: true,
	})

	extensions.RandomUserAgent(c)

	c.OnRequest(func(r *colly.Request) {
		log.Println("Visiting", r.URL)
		log.Println("UserAgent", r.Headers.Get("User-Agent"))
	})

soanni avatar Sep 29 '19 14:09 soanni

project is dead?

soanni avatar Oct 20 '19 11:10 soanni

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?

asciimoo avatar Jan 13 '20 02:01 asciimoo

Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()?

yes

package main
import (
	"fmt"
	"log"
	"net/http"
	"github.com/gocolly/colly"
)

func main() {
	url := "https://httpbin.org/ip"
	c := colly.NewCollector(
		colly.AllowURLRevisit(),
		colly.Async(true),
	)
	c.UserAgent = "curl/7.54.0"
	c.WithTransport(&http.Transport{ 
		DisableKeepAlives: true, 
	})

	c.OnRequest(func(r *colly.Request) {
		proxy := r.Ctx.Get("proxy")
		c.SetProxy(proxy) //Not working when colly.Async(true),
		log.Println("OnRequest proxy:", proxy)
	})
	c.OnResponse(func(r *colly.Response) {
		log.Println("OnResponse")
		// log.Println("r.Request.ProxyURL", r.Request.ProxyURL) 
		// log.Println("OnResponse Visited", r.Request.URL)

		log.Println(string(r.Body[:]))
		proxy := r.Ctx.Get("proxy")// alway is same one, the Last proxy
		fmt.Println("OnResponse proxy:", proxy)
		fmt.Println("------------")
	})
	c.OnError(func(r *colly.Response, err error) {
		log.Println("OnError ", r.StatusCode, err)
		proxy := r.Ctx.Get("proxy")
		fmt.Println("OnError proxy:", proxy)

		fmt.Println("------------")
	})

	for idx, proxy := range proxy_list {
		fmt.Println(idx, proxy)
		var ctx = colly.NewContext()
		ctx.Put("proxy", proxy)
		c.Request("GET", url, nil, ctx, nil) 
	}
	c.Wait()
}

makelove avatar Feb 24 '20 12:02 makelove

I think your design of colly have some problem

Why can't setup proxy on every single Request like Scrapy ? it is very easy to use

makelove avatar Feb 24 '20 12:02 makelove

@makelove good idea, would you like to work on it?

asciimoo avatar Feb 26 '20 15:02 asciimoo

User Agent and Proxy switching is working fine for me with the below set up. I did have some trouble getting this working though - I cannot get proxy rotation to work without DisableKeepAlives=True - would it be worth updating the documentation for this?

type httpBin struct {
	Headers struct {
		UserAgent string \`json:"User-Agent"\`
	} \`json:"headers"\`
	Origin string \`json:"origin"\`
}

func main() {
	// Instantiate the collector
	c := colly.NewCollector(

		// apply collector settings
		colly.AllowURLRevisit(),
		colly.Async(true), // testing async settings
	)

	// add random user agent extention
	extensions.RandomUserAgent(c)

	// load proxies into round robin switcher
	rp, err := proxy.RoundRobinProxySwitcher(proxies.GetAll()...) // list of proxy strings
	if err != nil {
		log.Fatal(err)
	}

	// if using async then disable transport keep alives
	c.WithTransport(&http.Transport{
		Proxy:             rp,
		DisableKeepAlives: true, // must be true
	})

	// Print the response
	c.OnResponse(func(r *colly.Response) {
		obj := httpBin{}
		err := json.Unmarshal(r.Body, &obj)
		if err != nil {
			log.Fatal(err)
		}

		fmt.Printf("%s: %s\n", obj.Origin, obj.Headers.UserAgent)
	})

	// create a request queue with 2 consumer threads
	q, _ := queue.New(
		2, // Number of consumer threads
		&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
	)

	for i := 0; i < 100; i++ {
		// Add URLs to the queue
		q.AddURL("https://httpbin.org/get")
	}
	// Consume URLs
	q.Run(c)

	// wait re async
	c.Wait()
}

littlecluster avatar Apr 08 '21 14:04 littlecluster