abot icon indicating copy to clipboard operation
abot copied to clipboard

Content.Text empty despite response code OK and Content stream contains data

Open seanarmstrong87 opened this issue 1 year ago • 0 comments

I am trying to crawl this page

https://www.tzb-info.cz/kontakty

By passing it to validUri in the following code:

        var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

        var crawledPage = await pageRequester.MakeRequestAsync(validUri).ConfigureAwait(false);
            
        Log.Logger.Information("{@Result}", new
        {
            url = crawledPage.Uri,
            status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
        });

        return crawledPage.Content.Text;

That website has a less common chartset in the header set like this

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

The result is the Content.Text is always empty despite the response code being successful.

If I try to read the response stream directly I get this exception:

The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

If I change the ChartSet on the response manually I am then able to read the stream:

args.CrawledPage.HttpResponseMessage.Content.Headers.ContentType.CharSet = @"ISO-8859-1";

This is my workaround for now.

Is this a bug that the "iso-8859-2" charset is not being interpreted correctly ? Or am I missing something from the configuration or setup in order to handle this charset?

seanarmstrong87 avatar Nov 28 '22 15:11 seanarmstrong87