docs-scraper icon indicating copy to clipboard operation
docs-scraper copied to clipboard

null byte issue

Open kijung-iM opened this issue 1 year ago • 2 comments

Description There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.

When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.

example site:

Docs-Scraper: https://docs.whatap.io/java/agent-load-amount 0 records) Docs-Scraper: https://docs.whatap.io/java/agent-dbsql 0 records) Docs-Scraper: https://docs.whatap.io/java/agent-apdex 0 records)

I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.

documentation_spider.py:162

def parse_from_sitemap(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)
    
    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if (not self.force_sitemap_urls_crawling) and (
            not self.is_rules_compliant(response)):
        print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
    else:
        # self.add_records(response, from_sitemap=True)
        self.add_records(response.replace(body=response_text), from_sitemap=True)
        # We don't return self.parse(response) in order to avoid crawling those web page

def parse_from_start_url(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if self.is_rules_compliant(response):
        self.add_records(response, from_sitemap=False)
    else:
        print("\033[94m> Ignored: from start url\033[0m " + response.url)

    # return self.parse(response)
    return self.parse(response.replace(body=response_text))

custom_downloader_middleware.py:37

body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8')  # UTF-8 encoding
url = self.driver.current_url

default_strategy.py:37

if self._body_contains_stop_content(response):
    return []

# remove null byte
cleaned_body = response.text.replace('\u0000', '')

self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)

records = self.get_records_from_dom(response.url)
return records

kijung-iM avatar Jun 17 '24 05:06 kijung-iM

Issue in Docusaurus: https://github.com/facebook/docusaurus/issues/9985

tats-u avatar Oct 01 '24 09:10 tats-u

Possibly related to https://github.com/scrapy/parsel/issues/123

tats-u avatar Oct 01 '24 10:10 tats-u