TidyManaged
TidyManaged copied to clipboard
Problems with encoding
Hi, I'm having problems with using the wrapper and UTF-8 strings. For example "é" would be replaced by �
I was just wondering if I'm doing anything fundamentally wrong? I've tried setting Input and Output character encoding values as well. Any help would be appreciated. Example code below.
Many thanks
using (TidyManaged.Document doc = TidyManaged.Document.FromString(myInput)) { doc.OutputXhtml = true; doc.CharacterEncoding = TidyManaged.EncodingType.Utf8; doc.CleanAndRepair(); myOutput = doc.Save(); }
Hi there - can you provide a sample HTML document? My (pretty rudimentary) testing seems to work fine with the accented character.
Hi, I had similar problem. Characters ěščřžýáíéúů were replaced by � after running the text through the parser. The text came from database, where it was stored with windows-1250 encoding. What I ended up with (after half a day of � spam) was this solution.
//converts str using its initial encoding to bytes, convert those bytes to encoding
//we want to use for parsing and get stream from that to be safe that .NET does not
//meddle with it
Encoding srcEncoding = Encoding.GetEncoding("windows-1250");
byte[] srcEncodingBytes = srcEncoding.GetBytes(str);
Encoding destEncoding = Encoding.UTF8;
byte[] destEncodingBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncodingBytes);
var strStream = new MemoryStream(destEncodingBytes);
//do the parsing
var doc = TidyManaged.Document.FromStream(strStream);
doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.ShowWarnings = false;
doc.Quiet = true;
doc.OutputXhtml = true;
doc.CleanAndRepair();
str = doc.Save();
There probably is a more elegant solution, but this works for me.
Here's a unit test to demonstrate what I assume is the same problem:
[Test]
public void RoundTripsUtf8File()
{
// ŋ (velar nasal)--> ŋ
// β (greek beta) (03B2) --> β
using (var input = TempFile.CreateAndGetPathButDontMakeTheFile())
{
var source = "<!DOCTYPE html><html><head> <meta charset='UTF-8'></head><body>ŋ β</body></html>";
File.WriteAllText(input.Path, source, Encoding.UTF8);
using (var tidy = TidyManaged.Document.FromFile(input.Path))
{
tidy.CharacterEncoding = EncodingType.Utf8; //tried Raw, too
tidy.CleanAndRepair();
using (var output = new TempFile())
{
tidy.Save(output.Path);
var newContents = File.ReadAllText(output.Path);
Assert.IsTrue(newContents.Contains("ŋ"), newContents);
}
}
}
}
This outputs:
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 October 2008), see www.w3.org">
<meta charset='UTF-8'>
<title></title>
</head>
<body>
ŋ β
</body>
</html>
Expected: True
But was: False
In contrast, from the command line,
tidy -utf8 test.htm
Does the expected: the two characters emerge in the same for that they went in.
After more experimenting, and trying to figure out why Radek's code works, I found that libtidy is apparently ignoring the CharacterEncoding property (at least with respect to the issue at hand). The documentation says:
This option specifies the character encoding Tidy uses for both the input and output.
Yet it seems to have no effect, at least with this issue of converting characters to their numeric character references when it shouldn't. I have tested this with streams and files. I have not successfully got a .net string to pass through without this unwanted conversion, using the Document.FromString() method.
So for files and streams, the solution is to not use CharacterEncoding, and explicitly set the InputCharacterEncoding and OutputCharacterEncoding to EncodingType.Utf8.
Short of fixing LibTidy itself, it seems we could change the TidyManaged wrapper to either drop the unnecessary and broken CharacterEncoding property, or have it explicitly set the other two.
Looking through the code, I wonder if we're asking for trouble with statements like this:
var tempEnc = this.CharacterEncoding;
What if the client instead used the InputEncoding and OutputEncoding parameters? What would the value of CharacterEncoding be at this point?
In the end, I think this CharacterEncoding property muddies the semantics and leads to errors. I understand that this comes from the c DLL, but this wrapper might still be better off dropping support for it.
This also works for me fine:
MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
{
doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CleanAndRepair();
output = doc.Save();
}
str.Close();
IMHO the problem is in Document.FromString()
method.
I agree with hatton that Document.CharacterEncoding
do nothing at all.
BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.
Is anybody able to fix those issues? They're really annoying.
So, How could I do to use FromStream method then get the right output?
@smirkchung - hrnr's example works for me - input and output are strings
This also works for me fine:
MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input)); using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str)) { doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8; doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8; doc.CleanAndRepair(); output = doc.Save(); } str.Close();
IMHO the problem is in
Document.FromString()
method. I agree with hatton thatDocument.CharacterEncoding
do nothing at all.BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.
Is anybody able to fix those issues? They're really annoying.
Thanks for your help! It works for me.