parser icon indicating copy to clipboard operation
parser copied to clipboard

mercury times out trying to access www.nasdaq.com (likely user-agent-ish related)

Open thoraxe opened this issue 4 years ago • 0 comments

  • Platform: Linux localhost.localdomain 5.9.16-200.fc33.x86_64 #1 SMP Mon Dec 21 14:08:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Mercury Parser Version: 2.2.0
  • Node Version (if a Node bug): v14.15.1

~/node_modules/.bin/mercury-parser https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

Expected Behavior

should extract the webpage

Current Behavior

Mercury Parser encountered a problem trying to parse that resource.

Error: ESOCKETTIMEDOUT
    at ClientRequest.<anonymous> (/home/thoraxe/node_modules/postman-request/request.js:1094:19)
    at Object.onceWrapper (events.js:421:28)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.emitRequestTimeout (_http_client.js:784:9)
    at Object.onceWrapper (events.js:421:28)
    at TLSSocket.emit (events.js:327:22)
    at TLSSocket.Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  code: 'ESOCKETTIMEDOUT',
  connect: false
}

Steps to Reproduce

  1. ~/node_modules/.bin/mercury-parser https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

Detailed Description

it appears that nasdaq may somehow be identifying us as a "non browser" and prohibiting the page from being loaded. curl gets an "Access denied'. Lynx never loads the page. The link definitely works.

Possible Solution

Not sure what the nasdaq server is doing to "identify" that we're not a real browser, but it's definitely not working. I also tried with spoofing a user agent:

~/node_modules/.bin/mercury-parser --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1" https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

I got the same timeout.

thoraxe avatar Jan 07 '21 02:01 thoraxe