algolia-docsearch-action icon indicating copy to clipboard operation
algolia-docsearch-action copied to clipboard

Decoding error

Open pgte opened this issue 2 years ago • 0 comments

Hi, I'm getting an error when running an indexing job:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb6 in position 8: ordinal not in range(128)

Do you have any hints?

Here is the full log:

(...)
[36](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:37)
Successfully installed certifi-2022.5.18.1 distlib-0.3.4 filelock-3.4.1 importlib-metadata-4.8.3 importlib-resources-5.4.0 pipenv-2022.4.8 platformdirs-2.4.0 six-1.16.0 typing-extensions-4.1.1 virtualenv-20.14.1 virtualenv-clone-0.5.7 zipp-3.6.0
[37](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:38)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
[38](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:39)
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
[39](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:40)
Installing dependencies from Pipfile.lock (aabb41)...
[40](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:41)
2022-05-31 16:57:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://dev.decipad.com/docs/language/numbers/> (referer: https://dev.decipad.com/docs/sitemap.xml)
[41](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:42)
Traceback (most recent call last):
[42](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:43)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 40, in get_dom
[43](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:44)
    body = response.body.decode(response.encoding)
[44](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:45)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 4: ordinal not in range(128)
[45](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:46)

[46](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:47)
During handling of the above exception, another exception occurred:
[47](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:48)

[48](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:49)
Traceback (most recent call last):
[49](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:50)
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
[50](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:51)
    current.result = callback(current.result, *args, **kw)
[51](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:52)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 169, in parse_from_sitemap
[52](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:53)
    self.add_records(response, from_sitemap=True)
[53](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:54)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 148, in add_records
[54](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:55)
    records = self.strategy.get_records_from_response(response)
[55](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:56)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/default_strategy.py", line 39, in get_records_from_response
[56](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:57)
    self.dom = self.get_dom(response)
[57](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:58)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 43, in get_dom
[58](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:59)
    result = lxml.html.fromstring(response.body)
[59](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:60)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 875, in fromstring
[60](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:61)
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
[61](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:62)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 764, in document_fromstring
[62](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:63)
    "Document is empty")
[63](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:64)
lxml.etree.ParserError: Document is empty
[64](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:65)
2022-05-31 16:57:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://dev.decipad.com/docs/language/> (referer: https://dev.decipad.com/docs/sitemap.xml)
[65](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:66)
Traceback (most recent call last):
[66](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:67)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 40, in get_dom
[67](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:68)
    body = response.body.decode(response.encoding)
[68](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:69)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb6 in position 8: ordinal not in range(128)
[69](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:70)

[70](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:71)
During handling of the above exception, another exception occurred:
[71](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:72)

[72](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:73)
Traceback (most recent call last):
[73](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:74)
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
[74](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:75)
    current.result = callback(current.result, *args, **kw)
[75](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:76)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 169, in parse_from_sitemap
[76](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:77)
    self.add_records(response, from_sitemap=True)
[77](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:78)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 148, in add_records
[78](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:79)
    records = self.strategy.get_records_from_response(response)
[79](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:80)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/default_strategy.py", line 39, in get_records_from_response
[80](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:81)
    self.dom = self.get_dom(response)
[81](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:82)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 43, in get_dom
[82](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:83)
    result = lxml.html.fromstring(response.body)
[83](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:84)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 875, in fromstring
[84](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:85)
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
[85](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:86)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 764, in document_fromstring
[86](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:87)
    "Document is empty")

pgte avatar May 31 '22 17:05 pgte