parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Selectors return corrupted, recursive DOM on some sites

Open shervinmathieu opened this issue 4 years ago • 3 comments

Description

On a specific site, scrapy selectors (css and xpath) corrupt the DOM recursively and return an incorrect amount of items as a result. I've encountered this issue while parsing base-search.net search results but this bug might occur on other sites as well.

Steps to Reproduce

Example for base-search.net

  1. Begin parsing a base.net search result page, ex: scrapy shell "https://www.base-search.net/Search/Results?lookfor=graph+visualisation"
  2. Notice the amount of div items with the class ".record-panel" present: response.css(".record-panel"), output should be 10 items
  3. Now select an item inside this div, for example response.css(".link-gruen"), output should also be only 10 items
  4. Now attempt to chain these two selectors: response.css(".record-panel").css(".link-gruen"), output now returns 55(!) items, when it has been determined there are only 10 .link-gruen items in the DOM
  5. Notice that response.css(".record-panel .record-panel") returns a non-zero amount of items, however on the original DOM no item with such a class exists
  6. Attempt to chain selectors on this non-existent element, and notice the amount of .link-gruen items returned increases recursively: response.css(".record-panel").css(".record-panel").css(".link-gruen") returns 220 items, response.css(".record-panel").css(".record-panel").css(".record-panel").css(".link-gruen") returns 715 items

Expected behavior: Only ten items should be returned in this example.

Actual behavior: Each selector has a DOM that contains their own .record-panel, but also all following .record-panel divs, nested recursively. Chaining selectors on this corrupted DOM corrupts it even further, increasing the amount of items returned infinitely.

Reproduces how often: Always

Versions

Scrapy : 1.8.0 lxml : 4.5.0.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.10.0 Python : 3.7.5 (default, Nov 7 2019, 10:50:52) - [GCC 8.3.0] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019) cryptography : 2.8 Platform : Linux-4.15.0-76-generic-x86_64-with-Ubuntu-18.04-bionic

Additional context

Issue happens on both css and xpath selectors. Using equivalent xpath selectors lead to the same result. Notice by opening view(response) that the DOM scrapy receives for parsing does not contain any recursive items, for example selecting .record-panel .record-panel yields no results on the browser selector (local file, not the internet result). However, on scrapy selecting response.css(".record-panel .record-panel") returns 9 items, response.css(".record-panel .record-panel .record-panel") returns 8 items, and so on.

shervinmathieu avatar Feb 07 '20 18:02 shervinmathieu

Transferred the issue here since it doesn't seem to be a problem with Scrapy specifically but rather with Parsel, the underlying selector library:

In [1]: from parsel import Selector, __version__                                                                                                                                                                                              

In [2]: __version__                                                                                                                                                                                                                           
Out[2]: '1.5.2'

In [3]: import requests                                                                                                                                                                                                                       

In [4]: sel = Selector(text=requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text)                                                                                                                      

In [5]: len(sel.css(".record-panel"))                                                                                                                                                                                                         
Out[5]: 10

In [6]: len(sel.css(".link-gruen"))                                                                                                                                                                                                           
Out[6]: 10

In [7]: len(sel.css(".record-panel").css(".link-gruen"))                                                                                                                                                                                      
Out[7]: 55

elacuesta avatar Feb 07 '20 18:02 elacuesta

:eyes:

>>> for panel in sel.css('.record-panel'):
...     print(len(panel.css('.link-gruen')))
... 
10
9
8
7
6
5
4
3
2
1

Gallaecio avatar May 22 '20 16:05 Gallaecio

There is a bug in the source HTML which browsers manage to fix but lxml does not: they use --!> to close HTML comments, instead of -->.

Workaround: .replace('--!>', '-->')

>>> text = requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text
>>> text = text.replace('--!>', '-->')
>>> sel = Selector(text=text)
>>> len(sel.css(".record-panel").css(".link-gruen"))
10
>>> for panel in sel.css('.record-panel'):
...     print(len(panel.css('.link-gruen')))
... 
1
1
1
1
1
1
1
1
1
1

I suggest we leave this open as a feature request.

Hopefully #83 will allow fixing this, but this issue should remain open: if a new parser introduced as part of #83 does not fix this issue, we should look for alternative parsers that do support this issue, or get support for this upstream on one of the supported parsers.

Gallaecio avatar May 22 '20 16:05 Gallaecio