w3lib icon indicating copy to clipboard operation
w3lib copied to clipboard

safe_url_string handling IPv6 URLs

Open Cash111 opened this issue 3 years ago • 5 comments

Description

Demo spider with settings:

DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo_spider'
    start_urls = ['https://[2402:4e00:40:40::2:3b6]']

    def parse(self, response, **kwargs):
        print(response.body)
        print(response)

Command to start the spider:

scrapy crawl demo_spider -s JOBDIR=./jobs/run-1

When i use the JOBDIR parameter, it cause an exception:

Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

I debugged and found that the problem was in urllib.parse#L202,as shown below:

SefJhostinfo(selp seilf SolftRewltCschene- 'https', netlec- 24024r0040402306443  PotheVp

And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.

hosts - potenttal domatnunotches(req host) hoEte T40  2w' '2402 tocol”)

The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .

When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.

Versions

$ scrapy version --verbose
Scrapy       : 2.6.3
lxml         : 4.9.1.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 2.0.1
Twisted      : 22.8.0
Python       : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-12.6-arm64-arm-64bit

Cash111 avatar Oct 09 '22 07:10 Cash111

Temporarily solved this problem by downgrading w3lib to 1.22.0

Cash111 avatar Oct 10 '22 02:10 Cash111

In [5]: safe_url_string('https://[2402:4e00:40:40::2:3b6]')
Out[5]: 'https://2402:4e00:40:40::2:3b6'

In [6]: safe_url_string('https://[2402:4e00:40:40::2:3b6]:80')
Out[6]: 'https://2402:4e00:40:40::2:3b6:80'

This indeed looks like a bug.

wRAR avatar Oct 10 '22 05:10 wRAR

urlsplit returns '[2402:4e00:40:40::2:3b6]:80' in netloc but 2402:4e00:40:40::2:3b6 in hostname, and safe_url_string uses this hostname value directly without putting it in brackets again. There may be some code in urllib that should be used here instead.

wRAR avatar Oct 10 '22 05:10 wRAR

Hi, i would like to work on this issue

himanshu007-creator avatar Oct 17 '22 10:10 himanshu007-creator

@himanshu007-creator sure, no problem with that

wRAR avatar Oct 17 '22 10:10 wRAR