SmartReader
SmartReader copied to clipboard
Support for german language characters
Hello, we are using your library but we encountered an issue with german sites, it cannot convert special characters like this Ö, Ä
I tried even putting in into files, but in files it looks the same.
Hello, thanks for your feedback.
The library does not manage encoding of the source directly, it delegates it to the AngleSharp library. This is widely used, so I do not think that they are unable to handle German characters. For example, you can try the web demo on a Der Spiegel article and verify that it actually works correctly.
Therefore either the issue is that the original source does not communicate encoding correctly (which may happen for old websites) or our library may mess up some particular case. We can troubleshoot this, if you can provide an example source that causes the issue.
Thanks for the fast reply, here is the test result of Der Spiegel so I assume you were right, the site is a problem :
simple code:
[TestMethod]
public void testGermanCharacters()
{
SmartReader.Reader sr = new SmartReader.Reader("https://www.spiegel.de/wirtschaft/soziales/greenpeace-und-umwelthilfe-verlangen-gesetzliche-vorgaben-zum-energiesparen-a-b9bc1454-c98c-433f-8988-da5a821d6e00");
SmartReader.Article article = sr.GetArticle();
using (StreamWriter swClifor = new StreamWriter(@"c:\_WA\NewsCrawlerExportFolder\test.txt", true))
{
swClifor.WriteLine(article.TextContent.ToString());
swClifor.Close();
}
}
is VS:
in Notepad:
the problematic site is this:
https://www.finanzen.net/nachricht/aktien/us-rohoellagerbestaende-steigen-unerwartet-11594814
I tried using different library just to check encoding, it has downloaded it correctly with correct encoding:
[TestMethod]
public void testGermanCharacters()
{
string german = "Die Rohöllagerbestände in den";
//SmartReader.Reader sr = new SmartReader.Reader("https://www.finanzen.net/nachricht/aktien/us-rohoellagerbestaende-steigen-unerwartet-11594814");
//SmartReader.Article article = sr.GetArticle();
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("https://www.finanzen.net/nachricht/aktien/us-rohoellagerbestaende-steigen-unerwartet-11594814");
using (StreamWriter swClifor = new StreamWriter(@"c:\_WA\NewsCrawlerExportFolder\test.txt", true))
{
swClifor.WriteLine(doc.Text.ToString());
swClifor.Close();
}
}
I have taken a look at the source you provided. It is true that the source is not displayed correctly. However, the encoding/charset is correct, the issue seems to be that they are using HTML character references to display German characters rather than standard characters.
In other words, if you inspect the HTML source with a browser, you will see this text:
WASHINGTON (Dow Jones)--Die Rohöllagerbestände in den USA haben sich in der Woche zum 29
That is quite unusual, I imagine it is a result of old age. Probably the approach made sense before the widespread usage of UTF-8. I think we can solve this, I will work on it this weekend.
Thank you very much ! Please let me know next week I will wait :)
The latest commit should fix the issue. The problem was not the unusual character references, but because for some reason AngleSharp ignored the encoding setting provided in the header. I think that is because in some misconfigured servers the encoding is incorrect, so they use some heuristics to determine the encoding. I added a setting ForceHeaderEncoding
(it defaults to false
, but for your case you should set it to true
) to force the use of the provided encoding.
At the moment the latest change is only on master, but it has not been released in a package
thank you, please let me know when it will be available as package, you are perfect ! :)
@marhyno I published version 0.9.0 that should fix your issue. It took a while because we found other issues to fix.