colly icon indicating copy to clipboard operation
colly copied to clipboard

`c.SetRequestTimeout` Doesn't forcefully close request after time limit is exceed

Open JustinTimperio opened this issue 4 years ago • 14 comments

Hi all, I am part of a research project called "Project-Lernaean". We are working on mapping and surveying the internet for both developers and security researchers. We have been using Colly quite successfully in our scans but recently I have run into a fairly obscure issue. My current guess is that honeypots create endless pages that are deadlocking a few jobs infinitely. I have set very aggressive timeouts which you can see below, but even still typically in a job of 10k IP's 4-15 will fully deadlock and c.Wait() never completes.

// Create Colly Collector
       c := colly.NewCollector(
               colly.MaxBodySize(10e9),
               colly.DetectCharset(),
               colly.Async(true),
       )
       c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 500})
       c.SetRequestTimeout(5 * time.Second)
       c.WithTransport(&http.Transport{
               Proxy: http.ProxyFromEnvironment,
               DialContext: (&net.Dialer{
                       Timeout:   5 * time.Second,
                       KeepAlive: 5 * time.Second,
                       DualStack: true,
               }).DialContext,
               MaxIdleConns:          500,
               IdleConnTimeout:       5 * time.Second,
               TLSHandshakeTimeout:   5 * time.Second,
               ExpectContinueTimeout: 1 * time.Second,
       })

I would appreciate any or guidance that the community has on dealing with this issue. I am also happy to create a test repository that can duplicate this issue. If possible adding a c.ForceClose() function would be extremely helpful in the case that workers for whatever reason, are unable to terminate.

JustinTimperio avatar Aug 26 '21 19:08 JustinTimperio

Try to send SIGQUIT (Ctrl+). Go will dump goroutine stacks traces and exit. Stack traces might help figure out where they got stuck.

WGH- avatar Aug 26 '21 20:08 WGH-

^\SIGQUIT: quit
PC=0x46d301 m=0 sigcode=128

goroutine 0 [idle]:
runtime.futex(0xec6450, 0x80, 0x0, 0x0, 0x7ffd00000000, 0x0, 0xadaf50, 0x0, 0x7ffd8e7e4268, 0x40c17f, ...)
        /usr/lib/go/src/runtime/sys_linux_amd64.s:579 +0x21
runtime.futexsleep(0xec6450, 0x0, 0xffffffffffffffff)
        /usr/lib/go/src/runtime/os_linux.go:44 +0x46
runtime.notesleep(0xec6450)
        /usr/lib/go/src/runtime/lock_futex.go:159 +0x9f
runtime.mPark()
        /usr/lib/go/src/runtime/proc.go:1340 +0x39
runtime.stopm()
        /usr/lib/go/src/runtime/proc.go:2301 +0x92
runtime.findrunnable(0xc000033000, 0x0)
        /usr/lib/go/src/runtime/proc.go:2960 +0x72e
runtime.schedule()
        /usr/lib/go/src/runtime/proc.go:3169 +0x2d7
runtime.goexit0(0xc004011980)
        /usr/lib/go/src/runtime/proc.go:3478 +0x1de
runtime.mcall(0x7f67cd18cce7)
        /usr/lib/go/src/runtime/asm_amd64.s:327 +0x5b

goroutine 1 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc00035c108)
        /usr/lib/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00035c100)
        /usr/lib/go/src/sync/waitgroup.go:130 +0x65
github.com/gocolly/colly/v2.(*Collector).Wait(...)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:812
plHttpCrawler/crawler.RunJob(0xc00035ced0, 0xc, 0x1f48, 0x58, 0xfa, 0x2710, 0xc00035cf18, 0x4, 0x1, 0x61280f41, ...)
        /home/conductor/plHttpCrawler/crawler/crawl.go:201 +0x56a
plHttpCrawler/crawler.Crawl(0xc00035ced0, 0xc, 0x1f48, 0x58, 0xfa, 0x2710, 0xc00035cf18, 0x4, 0x1, 0x61280f41, ...)
        /home/conductor/plHttpCrawler/crawler/crawl.go:37 +0x5d8
main.main()
        /home/conductor/plHttpCrawler/.BUILD_SOURCE/plHttpCrawler.go:40 +0x878

goroutine 849 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f67a647a5a0, 0x72, 0xffffffffffffffff)
        /usr/lib/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc0035e2698, 0x72, 0x200, 0x200, 0xffffffffffffffff)
        /usr/lib/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0035e2680, 0xc00451c000, 0x200, 0x200, 0x0, 0x0, 0x0)
        /usr/lib/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc0035e2680, 0xc00451c000, 0x200, 0x200, 0x7f67a6149ef0, 0xc000918e00, 0x0)
        /usr/lib/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00021e910, 0xc00451c000, 0x200, 0x200, 0x0, 0x0, 0x0)
        /usr/lib/go/src/net/net.go:183 +0x91
net/http.(*readWriteCloserBody).Read(0xc000535890, 0xc00451c000, 0x200, 0x200, 0xc00451a035, 0xc00451c000, 0x7f67a6149ef0)
        /usr/lib/go/src/net/http/transport.go:2371 +0x102
net/http.(*cancelTimerBody).Read(0xc001384580, 0xc00451c000, 0x200, 0x200, 0x972700, 0x9c8101, 0xc00451c000)
        /usr/lib/go/src/net/http/client.go:939 +0x62
io.(*LimitedReader).Read(0xc0005358a8, 0xc00451c000, 0x200, 0x200, 0xc0005358a8, 0x0, 0xc000040400)
        /usr/lib/go/src/io/io.go:473 +0x63
io.ReadAll(0xae8480, 0xc0005358a8, 0x0, 0x0, 0xffffffffffffffff, 0xc000e7b3e0, 0x0)
        /usr/lib/go/src/io/io.go:633 +0xdf
io/ioutil.ReadAll(...)
        /usr/lib/go/src/io/ioutil/ioutil.go:27
github.com/gocolly/colly/v2.(*httpBackend).Do(0xc000071440, 0xc000918e00, 0x2540be400, 0xc0013fff38, 0x0, 0x0, 0x0)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/[email protected]/http_backend.go:209 +0x349
github.com/gocolly/colly/v2.(*httpBackend).Cache(0xc000071440, 0xc000918e00, 0x2540be400, 0xc001138f38, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/[email protected]/http_backend.go:134 +0x77
github.com/gocolly/colly/v2.(*Collector).fetch(0xc0000f6700, 0xc0000292e0, 0x19, 0xa40e5d, 0x3, 0x1, 0x0, 0x0, 0xc000ad2a50, 0xc000924210, ...)
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:653 +0x2b3
created by github.com/gocolly/colly/v2.(*Collector).scrape
        /home/conductor/go/pkg/mod/github.com/gocolly/colly/[email protected]/colly.go:574 +0x3e5

rax    0xca
rbx    0xec6300
rcx    0x46d303
rdx    0x0
rdi    0xec6450
rsi    0x80
rbp    0x7ffd8e7e4230
rsp    0x7ffd8e7e41e8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0x1c65
r14    0xadaf50
r15    0x0
rip    0x46d301
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

JustinTimperio avatar Aug 26 '21 21:08 JustinTimperio

I've created the following repository to show the bug in action: https://github.com/Project-Lernaean/CollyTimeoutError

To reproduce use go run main.go and it will iterate through a list of 100 IP port combos. It will always endlessly hang on at least one IP.

JustinTimperio avatar Aug 26 '21 23:08 JustinTimperio

My best guess is this is related to https://github.com/golang/go/issues/27752#issuecomment-492677602 The Dialer.Deadline should be setting the connection deadline, but sometimes doesn't work. If you reduce the number of urls sometimes they all work, sometimes they do not. Maybe reducing the testcase to remove Colly and demonstrate the problem with http.Transport will help get to the underlying issue.

dherbst avatar Aug 30 '21 12:08 dherbst

@dherbst Thanks for the advice. I had a feeling this was lower level than colly but I was hoping that it wasn't an issue with the HTTP package. I can modify my bug demonstration to not use colly and submit it to the Golang repo but I'll do some more testing before then.

JustinTimperio avatar Aug 30 '21 14:08 JustinTimperio

@JustinTimperio I rewrote your test without colly and I can reproduce the hanging issue with a subset of your url list here: https://github.com/dherbst/deadlinetest

dherbst avatar Aug 31 '21 11:08 dherbst

Awesome work @dherbst! I can sumbit this to the golang source repo with a full explanation of what's going on with the bug later this afternoon

JustinTimperio avatar Aug 31 '21 18:08 JustinTimperio

Then c.SetRequestTimeout() should probably do what this comment says right? I assume it doesn't currently, as then @JustinTimperio wouldn't have had the issue.

net.Dialer.Timeout and net.Dialer.Deadline are indeed for the controlling the deadline of the dial operation only. Setting a request deadline with http.NewRequestWithContext should work in all circumstances, including when reads or writes to the underlying connection are slow. If that doesn't work, we have a bug, but that does not seem to be the case here.

k4lizen avatar Aug 09 '23 03:08 k4lizen

But Collector.SetRequestTimeout sets net/http.Client's Timeout, which looks correct.

And thus repro in github.com/dherbst/deadlinetest is wrong, as it sets the dialer's timeout.

WGH- avatar Aug 09 '23 14:08 WGH-

I see that now, thanks. Fixed the test.

dherbst avatar Aug 09 '23 15:08 dherbst

For me https://github.com/dherbst/deadlinetest doesn't hang. Also https://github.com/Project-Lernaean/CollyTimeoutError doesn't hang (even when using urls.txt.bak). Also, https://github.com/Project-Lernaean/CollyTimeoutError doesn't actually use SetRequestTimeout anywhere in the code.

k4lizen avatar Oct 14 '23 11:10 k4lizen

For me https://github.com/dherbst/deadlinetest doesn't hang.

It will not hang any more, I fixed the test.

It was originally hanging because I based it off the CollyTimeoutError repo, which omitted the timeout. I haven't looked at this code in a couple of years, but if I remember correctly the issue originally reported is fixed by setting the request timeout as I have in my test. This issue should probably be closed.

dherbst avatar Oct 14 '23 12:10 dherbst

And yet the original report had c.SetRequestTimeout, which has always set net/http.Client's Timeout.

Something isn't right, but I'm inclining to close this until some concrete reproduction arrives.

WGH- avatar Oct 16 '23 22:10 WGH-