html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Can't get entities properly output

Open Bernard-Martin opened this issue 5 years ago • 6 comments

Description

I am trying to use HAP to convert HTML to XHTML.

HTML entities are not recognized and therefore not correctly output. For example,   is output as  

Am I missing something or is it definetely a problem ?

Further technical details

  • HAP version: 1.11.23
  • NET version (net472, netcore, etc.): .net framework 4.8

VB 👍 Dim doc As New HtmlAgilityPack.HtmlDocument() Using reader As New StreamReader(Path_PWeb_Filtered) Dim html As String = reader.ReadToEnd() doc.LoadHtml(html) doc.CreateNavigator() doc.DocumentNode.SelectSingleNode("//style").Remove() doc.OptionAutoCloseOnEnd = True doc.OptionOutputAsXml = True Dim settings As XmlWriterSettings = New XmlWriterSettings() settings.ConformanceLevel = ConformanceLevel.Auto Dim writer As XmlWriter = XmlWriter.Create(My.Computer.FileSystem.GetTempFileName, settings) doc.Save(writer) End Using

Bernard-Martin avatar May 13 '20 18:05 Bernard-Martin

Hello @Bernard-Martin ,

Could you provide a project with the issue?

We made a quick test and space appeared as space and not   and the   was escaped as expected.

Best Regards,

Jon


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

JonathanMagnan avatar May 14 '20 00:05 JonathanMagnan

Hi Jon,

Many thanks for your quick reply.

Further to my email yesterday, I made all sorts of experimentations especially with character encodings. But no matter what I try, it seems that "doc.OptionOutputAsXml = True" is the problem.

I attach a zip file which contains my input file (MS-Word generated HTML) and the output files I get, html and xhtml. As you will see :

  • with doc.OptionOutputAsXml = False" I get the correct entities (but the file is of course HTML, which is not what I want).
  • with doc.OptionOutputAsXml = True", I get XHTML (meta, br, etc are now self-closing), but the entities are wrong ( )

I can't provide you with a real project, as I am just evaluating options for now. Eventually, it will be an epub creation from Word files. I know things already exist for this, but it seems none has the flexibility that's required.

The VB code I used for testing is : Dim doc As New HtmlAgilityPack.HtmlDocument() Dim outfile As String = My.Computer.FileSystem.GetTempFileName Dim fs As New FileStream(outfile, FileMode.OpenOrCreate) Using writer As New StreamWriter(fs) doc.DetectEncodingAndLoad(infile) doc.CreateNavigator() doc.OptionOutputAsXml = True ' true for xhtml, false for html doc.DocumentNode.SelectSingleNode("//style").Remove() doc.Save(writer) writer.Close() End Using

I hope this helps understanding what's happening !

Have a good day,

Bernard

On Thu, May 14, 2020 at 2:54 AM Jonathan Magnan [email protected] wrote:

Hello @Bernard-Martin https://github.com/Bernard-Martin ,

Could you provide a project with the issue?

We made a quick test and space appeared as space and not   and the   was escaped as expected.

Best Regards,

Jon

Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework Extensions http://entityframework-extensions.net/ • Entity Framework Classic http://entityframework-classic.net/ • Bulk Operations http://bulk-operations.net/ • Dapper Plus http://dapper-plus.net/

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval Function http://eval-expression.net/ • SQL Eval Function http://eval-sql.net/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zzzprojects/html-agility-pack/issues/390#issuecomment-628322455, or unsubscribe https://github.com/notifications/unsubscribe-auth/APSICUF45LPTY5TYKH7PDBLRRM6LXANCNFSM4NAAJFVQ .

-- Bernard Martin

Bernard-Martin avatar May 14 '20 08:05 Bernard-Martin

Hello @Bernard-Martin ,

To who are you send your zip file? I don't see any email from you in our inbox: [email protected] or the project attached to this issue.

JonathanMagnan avatar May 14 '20 12:05 JonathanMagnan

Hello Jon, I just hit the reply button and attached the zip file to my email. Apparently it went to < [email protected]>, I did not notice.

What is the normal procedure ?

Thanks,

Bernard

On Thu, May 14, 2020 at 2:23 PM Jonathan Magnan [email protected] wrote:

Hello @Bernard-Martin https://github.com/Bernard-Martin ,

To who are you send your zip file? I don't see any email from you in our inbox: [email protected] or the project attached to this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zzzprojects/html-agility-pack/issues/390#issuecomment-628598313, or unsubscribe https://github.com/notifications/unsubscribe-auth/APSICUAYFC4IL2BJO2PHPN3RRPPEHANCNFSM4NAAJFVQ .

-- Bernard Martin

Bernard-Martin avatar May 14 '20 13:05 Bernard-Martin

Send it directly here: [email protected] ;)

JonathanMagnan avatar May 14 '20 17:05 JonathanMagnan

Oh I see, I just edited your post as some character was hidden (the first space) which was confused me.

You output as XML, so that's normal that &nbsp; get escaped (In fact, this is the & which is escaped.

Special character escaped form gets replaced by
Ampersand & &
Less-than < <
Greater-than > >
Quotes " "
Apostrophe ' '

We cannot do anything if you write in XML, those must be escaped.

JonathanMagnan avatar May 15 '20 02:05 JonathanMagnan