abot
abot copied to clipboard
Content.Text empty despite response code OK and Content stream contains data
I am trying to crawl this page
https://www.tzb-info.cz/kontakty
By passing it to validUri in the following code:
var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());
var crawledPage = await pageRequester.MakeRequestAsync(validUri).ConfigureAwait(false);
Log.Logger.Information("{@Result}", new
{
url = crawledPage.Uri,
status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});
return crawledPage.Content.Text;
That website has a less common chartset in the header set like this
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
The result is the Content.Text is always empty despite the response code being successful.
If I try to read the response stream directly I get this exception:
The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.
If I change the ChartSet on the response manually I am then able to read the stream:
args.CrawledPage.HttpResponseMessage.Content.Headers.ContentType.CharSet = @"ISO-8859-1";
This is my workaround for now.
Is this a bug that the "iso-8859-2" charset is not being interpreted correctly ? Or am I missing something from the configuration or setup in order to handle this charset?