broken-link-checker icon indicating copy to clipboard operation
broken-link-checker copied to clipboard

Latest BLC does not finish properly

Open dbogatov opened this issue 8 years ago • 10 comments

Sometimes (and I would say most of the time), latest BLC (v0.7.6) silently fails in the middle of the work. As a consequence, does not report the result (and exit code is of no use).

Last known version that does not have that bug is v0.7.3.

See output

$ docker run -it node:8.9.1-alpine /bin/sh
/ # npm install -g broken-link-checker
npm WARN deprecated [email protected]: try optionator
/usr/local/bin/blc -> /usr/local/lib/node_modules/broken-link-checker/bin/blc
/usr/local/bin/broken-link-checker -> /usr/local/lib/node_modules/broken-link-checker/bin/blc
+ [email protected]
added 100 packages in 3.561s
/ # blc https://google.com
Getting links from: https://google.com/
├───OK─── https://www.google.com/imghp?hl=en&tab=wi
├───OK─── https://maps.google.com/maps?hl=en&tab=wl
├───OK─── https://play.google.com/?hl=en&tab=w8
├───OK─── https://news.google.com/nwshp?hl=en&tab=wn
├───OK─── https://mail.google.com/mail/?tab=wm
├───OK─── https://www.youtube.com/?gl=US&tab=w1
├───OK─── https://drive.google.com/?tab=wo
├───OK─── https://www.google.com/intl/en/options/
├───OK─── http://www.google.com/history/optout?hl=en
├───OK─── https://www.google.com/preferences?hl=en
├───OK─── https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/
├───OK─── https://www.google.com/search?site=&ie=UTF-8&q=Chinua+Achebe&oi=ddle&ct=chinua-achebes-87th-birthday-5104396332433408&hl=en&sa=X&ved=0ahUKEwjC4dzIo8TXAhUB4iYKHYtvB7UQPQgD
├───OK─── https://www.google.com/logos/doodles/2017/chinua-achebes-87th-birthday-5104396332433408.3-l.png
├───OK─── https://www.google.com/advanced_search?hl=en&authuser=0
├───OK─── https://www.google.com/language_tools?hl=en&authuser=0
├───OK─── https://www.google.com/intl/en/ads/
├───OK─── https://www.google.com/services/
├───OK─── https://plus.google.com/116899029375914044550
├───OK─── https://www.google.com/intl/en/about.html
├───OK─── https://www.google.com/intl/en/policies/privacy/
/ #

dbogatov avatar Nov 16 '17 23:11 dbogatov

I've just had the same happen to me. It took a lot longer on my site crawl (has logged about 15mb of output to file) but is sitting there spinning but not going anywhere.

Did anyone find a workaround to this?

pointandyshoot avatar Feb 27 '18 20:02 pointandyshoot

Node version? Also, try the v0.8.0 branch

stevenvachon avatar Feb 27 '18 22:02 stevenvachon

Node version v6.11.4 BLC 0.7.7

pointandyshoot avatar Feb 27 '18 23:02 pointandyshoot

Not sure if relevant, but we were seeing a hang in our own broken link checker that uses this library. My colleague implemented a small workaround in our code that seems to be helping: https://github.com/code-dot-org/code-dot-org/pull/21310

breville avatar Mar 26 '18 20:03 breville

Update: we've continued getting zombie processes that didn't exit, after all.

breville avatar Jun 12 '18 19:06 breville

I've been getting something similar. BLC will just spin. For what its worth, I traced it down to trying to look up this address: https://www.sothebys.com/en/ I didn't notice anything crazy on that page or the headers, so no clue past that.

pingevt avatar Sep 15 '20 13:09 pingevt

For what it's worth, I've struck this problem too, where it seems BLC simply hangs near the end of processing the links.

It's been working fine for a long time for me (version 0.7.6), but suddenly starting hanging and never completing -- I suspect there's a particular link somewhere that's not getting processed correctly (an unresolved promise?), though I notice when I process different quantities of links (e.g. 10, 100, 400), it processes pretty much all of them before hanging.

In order to work-around this, I've used a setTimeout in the link API callback, such that if the setTimeout is not cleared within 30 seconds, it calls my finish routine that would normally be called by the end API callback:

  let timeout
  let linkCount
  ...

let htmlUrlChecker = new blc.HtmlUrlChecker({
  excludeInternalLinks: true,
  cacheResponses: false,
  excludeLinksToSamePage: true,
}, {
  link: function (result) {
    linkCount++
    log(`Processed ${linkCount}: ${result.url.original}`)

    clearTimeout(timeout)
    timeout = setTimeout(() => {
      // broken-link-checker may not finish -- refer:
      // * https://github.com/stevenvachon/broken-link-checker/issues/90
      // It does however seem to always get stuck almost at the end.
      // After waiting 30 seconds for the next link to be processed,
      // we'll exit.
      finish()
    }, 30000) // 30 seconds
  },
  end: function () {
    finish()
  },
})

jcdarwin avatar Apr 14 '21 07:04 jcdarwin

I'm also having the same issue. @jcdarwin suggestion is exactly what I was thinking of doing, so I'm glad to know that I'm not the only one dealing with that issue.

However, it would be good to find the culprit for the process to be hanging close to the end. Right now we're unable to check roughly 40 links in a database of more than 1000 links.

Jeandcc avatar May 10 '21 22:05 Jeandcc

Still seeing this today…

aarongustafson avatar Mar 31 '23 16:03 aarongustafson

In order to work-around this, I've used a setTimeout in the link API callback, such that if the setTimeout is not cleared within 30 seconds, it calls my finish routine that would normally be called by the end API callback:

@jcdarwin What does your finish() routine look like? I tried to figure out how to un-stall it by looking at the project source, but didn’t see a clear way to access done() on the item.

aarongustafson avatar Mar 31 '23 17:03 aarongustafson