TidyManaged icon indicating copy to clipboard operation
TidyManaged copied to clipboard

Problems with encoding

Open stebrennancode opened this issue 13 years ago • 9 comments

Hi, I'm having problems with using the wrapper and UTF-8 strings. For example "é" would be replaced by �

I was just wondering if I'm doing anything fundamentally wrong? I've tried setting Input and Output character encoding values as well. Any help would be appreciated. Example code below.

Many thanks

using (TidyManaged.Document doc = TidyManaged.Document.FromString(myInput)) { doc.OutputXhtml = true; doc.CharacterEncoding = TidyManaged.EncodingType.Utf8; doc.CleanAndRepair(); myOutput = doc.Save(); }

stebrennancode avatar Mar 21 '11 14:03 stebrennancode

Hi there - can you provide a sample HTML document? My (pretty rudimentary) testing seems to work fine with the accented character.

markbeaton avatar Jun 10 '11 01:06 markbeaton

Hi, I had similar problem. Characters ěščřžýáíéúů were replaced by � after running the text through the parser. The text came from database, where it was stored with windows-1250 encoding. What I ended up with (after half a day of � spam) was this solution.

        //converts str using its initial encoding to bytes, convert those bytes to encoding 
        //we want to use for parsing and get stream from that to be safe that .NET does not 
        //meddle with it
        Encoding srcEncoding = Encoding.GetEncoding("windows-1250");
        byte[] srcEncodingBytes = srcEncoding.GetBytes(str);
        Encoding destEncoding = Encoding.UTF8;
        byte[] destEncodingBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncodingBytes);
        var strStream = new MemoryStream(destEncodingBytes);

        //do the parsing
        var doc = TidyManaged.Document.FromStream(strStream);
        doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.ShowWarnings = false;
        doc.Quiet = true;
        doc.OutputXhtml = true;
        doc.CleanAndRepair();
        str = doc.Save(); 

There probably is a more elegant solution, but this works for me.

RadekMlada avatar Aug 12 '11 21:08 RadekMlada

Here's a unit test to demonstrate what I assume is the same problem:

[Test]
public void RoundTripsUtf8File()
{
    // ŋ (velar nasal)--> ŋ
    // β (greek beta) (03B2) --> β
    using (var input = TempFile.CreateAndGetPathButDontMakeTheFile())
    {
        var source = "<!DOCTYPE html><html><head> <meta charset='UTF-8'></head><body>ŋ β</body></html>";
        File.WriteAllText(input.Path, source, Encoding.UTF8);
        using (var tidy = TidyManaged.Document.FromFile(input.Path))
        {
            tidy.CharacterEncoding = EncodingType.Utf8; //tried Raw, too
            tidy.CleanAndRepair();
            using (var output = new TempFile())
            {
                tidy.Save(output.Path);
                var newContents = File.ReadAllText(output.Path);
                Assert.IsTrue(newContents.Contains("ŋ"), newContents);
            }
        }
    }
}

This outputs:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 October 2008), see www.w3.org">
<meta charset='UTF-8'>
<title></title>
</head>
<body>
&#331; &#946;
</body>
</html>
Expected: True
But was:  False

In contrast, from the command line,

tidy -utf8 test.htm

Does the expected: the two characters emerge in the same for that they went in.

hatton avatar Dec 03 '11 18:12 hatton

After more experimenting, and trying to figure out why Radek's code works, I found that libtidy is apparently ignoring the CharacterEncoding property (at least with respect to the issue at hand). The documentation says:

This option specifies the character encoding Tidy uses for both the input and output.

Yet it seems to have no effect, at least with this issue of converting characters to their numeric character references when it shouldn't. I have tested this with streams and files. I have not successfully got a .net string to pass through without this unwanted conversion, using the Document.FromString() method.

So for files and streams, the solution is to not use CharacterEncoding, and explicitly set the InputCharacterEncoding and OutputCharacterEncoding to EncodingType.Utf8.

Short of fixing LibTidy itself, it seems we could change the TidyManaged wrapper to either drop the unnecessary and broken CharacterEncoding property, or have it explicitly set the other two.

hatton avatar Dec 04 '11 14:12 hatton

Looking through the code, I wonder if we're asking for trouble with statements like this:

var tempEnc = this.CharacterEncoding;

What if the client instead used the InputEncoding and OutputEncoding parameters? What would the value of CharacterEncoding be at this point?

In the end, I think this CharacterEncoding property muddies the semantics and leads to errors. I understand that this comes from the c DLL, but this wrapper might still be better off dropping support for it.

hatton avatar Dec 05 '11 14:12 hatton

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method. I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

hrnr avatar Jan 10 '12 16:01 hrnr

So, How could I do to use FromStream method then get the right output?

smirkchung avatar Feb 09 '15 17:02 smirkchung

@smirkchung - hrnr's example works for me - input and output are strings

rangler2 avatar Jan 05 '16 15:01 rangler2

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method. I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

Thanks for your help! It works for me.

bao-vn avatar Nov 13 '18 16:11 bao-vn