colly
colly copied to clipboard
User-agent switching doesn't work with Proxy
Hi there, thank you for your amazing job, it's really a great framework! I'm scratching my head several days but can't understand what's wrong ... I'm using HTTP Forward Proxy (Squid) and User-Agent switching from Colly extensions but in the Squid logs i can see that User-Agent header is default Golang-user-agent-1.1. However in OnRequest hook i can see that user-agent switching does happens, moreover the User-Agent is still custom even in http_backend.go.Do() method.
func (h *httpBackend) Do(request *http.Request, bodySize int) (*Response, error) {
fmt.Println(request) // I'm logging to check the User-Agent is custom
res, err := h.Client.Do(request)
But on Squid side i see 'Golang-user-agent-1.1' for every request. I suspect that smth happens with User-Agent header on net/http side particularly when you use Proxy. The code is below (i tried with ProxySwicther extension but still no luck)
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
colly.Async(true),
)
c.WithTransport(&http.Transport{
Proxy: func(pr *http.Request) (*url.URL, error){
parsedU, err := url.Parse(viper.GetString("squid"))
if err != nil {
return nil, err
}
return parsedU, nil
},
DisableKeepAlives: true,
})
extensions.RandomUserAgent(c)
c.OnRequest(func(r *colly.Request) {
log.Println("Visiting", r.URL)
log.Println("UserAgent", r.Headers.Get("User-Agent"))
})
project is dead?
Hmm.. Interesting, do you get the same result if you set the proxy with Collector.SetProxy()
?
Hmm.. Interesting, do you get the same result if you set the proxy with
Collector.SetProxy()
?
yes
package main
import (
"fmt"
"log"
"net/http"
"github.com/gocolly/colly"
)
func main() {
url := "https://httpbin.org/ip"
c := colly.NewCollector(
colly.AllowURLRevisit(),
colly.Async(true),
)
c.UserAgent = "curl/7.54.0"
c.WithTransport(&http.Transport{
DisableKeepAlives: true,
})
c.OnRequest(func(r *colly.Request) {
proxy := r.Ctx.Get("proxy")
c.SetProxy(proxy) //Not working when colly.Async(true),
log.Println("OnRequest proxy:", proxy)
})
c.OnResponse(func(r *colly.Response) {
log.Println("OnResponse")
// log.Println("r.Request.ProxyURL", r.Request.ProxyURL)
// log.Println("OnResponse Visited", r.Request.URL)
log.Println(string(r.Body[:]))
proxy := r.Ctx.Get("proxy")// alway is same one, the Last proxy
fmt.Println("OnResponse proxy:", proxy)
fmt.Println("------------")
})
c.OnError(func(r *colly.Response, err error) {
log.Println("OnError ", r.StatusCode, err)
proxy := r.Ctx.Get("proxy")
fmt.Println("OnError proxy:", proxy)
fmt.Println("------------")
})
for idx, proxy := range proxy_list {
fmt.Println(idx, proxy)
var ctx = colly.NewContext()
ctx.Put("proxy", proxy)
c.Request("GET", url, nil, ctx, nil)
}
c.Wait()
}
I think your design of colly have some problem
Why can't setup proxy on every single Request like Scrapy ? it is very easy to use
@makelove good idea, would you like to work on it?
User Agent and Proxy switching is working fine for me with the below set up. I did have some trouble getting this working though - I cannot get proxy rotation to work without DisableKeepAlives=True
- would it be worth updating the documentation for this?
type httpBin struct {
Headers struct {
UserAgent string \`json:"User-Agent"\`
} \`json:"headers"\`
Origin string \`json:"origin"\`
}
func main() {
// Instantiate the collector
c := colly.NewCollector(
// apply collector settings
colly.AllowURLRevisit(),
colly.Async(true), // testing async settings
)
// add random user agent extention
extensions.RandomUserAgent(c)
// load proxies into round robin switcher
rp, err := proxy.RoundRobinProxySwitcher(proxies.GetAll()...) // list of proxy strings
if err != nil {
log.Fatal(err)
}
// if using async then disable transport keep alives
c.WithTransport(&http.Transport{
Proxy: rp,
DisableKeepAlives: true, // must be true
})
// Print the response
c.OnResponse(func(r *colly.Response) {
obj := httpBin{}
err := json.Unmarshal(r.Body, &obj)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s: %s\n", obj.Origin, obj.Headers.UserAgent)
})
// create a request queue with 2 consumer threads
q, _ := queue.New(
2, // Number of consumer threads
&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
)
for i := 0; i < 100; i++ {
// Add URLs to the queue
q.AddURL("https://httpbin.org/get")
}
// Consume URLs
q.Run(c)
// wait re async
c.Wait()
}