Can't get entities properly output
Description
I am trying to use HAP to convert HTML to XHTML.
HTML entities are not recognized and therefore not correctly output.
For example, is output as  
Am I missing something or is it definetely a problem ?
Further technical details
- HAP version: 1.11.23
- NET version (net472, netcore, etc.): .net framework 4.8
VB 👍 Dim doc As New HtmlAgilityPack.HtmlDocument() Using reader As New StreamReader(Path_PWeb_Filtered) Dim html As String = reader.ReadToEnd() doc.LoadHtml(html) doc.CreateNavigator() doc.DocumentNode.SelectSingleNode("//style").Remove() doc.OptionAutoCloseOnEnd = True doc.OptionOutputAsXml = True Dim settings As XmlWriterSettings = New XmlWriterSettings() settings.ConformanceLevel = ConformanceLevel.Auto Dim writer As XmlWriter = XmlWriter.Create(My.Computer.FileSystem.GetTempFileName, settings) doc.Save(writer) End Using
Hello @Bernard-Martin ,
Could you provide a project with the issue?
We made a quick test and space appeared as space and not and the was escaped as expected.
Best Regards,
Jon
Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework Extensions • Entity Framework Classic • Bulk Operations • Dapper Plus
Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval Function • SQL Eval Function
Hi Jon,
Many thanks for your quick reply.
Further to my email yesterday, I made all sorts of experimentations especially with character encodings. But no matter what I try, it seems that "doc.OptionOutputAsXml = True" is the problem.
I attach a zip file which contains my input file (MS-Word generated HTML) and the output files I get, html and xhtml. As you will see :
- with doc.OptionOutputAsXml = False" I get the correct entities (but the file is of course HTML, which is not what I want).
- with doc.OptionOutputAsXml = True", I get XHTML (meta, br, etc are now self-closing), but the entities are wrong ( )
I can't provide you with a real project, as I am just evaluating options for now. Eventually, it will be an epub creation from Word files. I know things already exist for this, but it seems none has the flexibility that's required.
The VB code I used for testing is : Dim doc As New HtmlAgilityPack.HtmlDocument() Dim outfile As String = My.Computer.FileSystem.GetTempFileName Dim fs As New FileStream(outfile, FileMode.OpenOrCreate) Using writer As New StreamWriter(fs) doc.DetectEncodingAndLoad(infile) doc.CreateNavigator() doc.OptionOutputAsXml = True ' true for xhtml, false for html doc.DocumentNode.SelectSingleNode("//style").Remove() doc.Save(writer) writer.Close() End Using
I hope this helps understanding what's happening !
Have a good day,
Bernard
On Thu, May 14, 2020 at 2:54 AM Jonathan Magnan [email protected] wrote:
Hello @Bernard-Martin https://github.com/Bernard-Martin ,
Could you provide a project with the issue?
We made a quick test and space appeared as space and not and the was escaped as expected.
Best Regards,
Jon
Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework Extensions http://entityframework-extensions.net/ • Entity Framework Classic http://entityframework-classic.net/ • Bulk Operations http://bulk-operations.net/ • Dapper Plus http://dapper-plus.net/
Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval Function http://eval-expression.net/ • SQL Eval Function http://eval-sql.net/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zzzprojects/html-agility-pack/issues/390#issuecomment-628322455, or unsubscribe https://github.com/notifications/unsubscribe-auth/APSICUF45LPTY5TYKH7PDBLRRM6LXANCNFSM4NAAJFVQ .
-- Bernard Martin
Hello @Bernard-Martin ,
To who are you send your zip file? I don't see any email from you in our inbox: [email protected] or the project attached to this issue.
Hello Jon, I just hit the reply button and attached the zip file to my email. Apparently it went to < [email protected]>, I did not notice.
What is the normal procedure ?
Thanks,
Bernard
On Thu, May 14, 2020 at 2:23 PM Jonathan Magnan [email protected] wrote:
Hello @Bernard-Martin https://github.com/Bernard-Martin ,
To who are you send your zip file? I don't see any email from you in our inbox: [email protected] or the project attached to this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zzzprojects/html-agility-pack/issues/390#issuecomment-628598313, or unsubscribe https://github.com/notifications/unsubscribe-auth/APSICUAYFC4IL2BJO2PHPN3RRPPEHANCNFSM4NAAJFVQ .
-- Bernard Martin
Send it directly here: [email protected] ;)
Oh I see, I just edited your post as some character was hidden (the first space) which was confused me.
You output as XML, so that's normal that get escaped (In fact, this is the & which is escaped.
| Special character | escaped form | gets replaced by |
|---|---|---|
| Ampersand | & | & |
| Less-than | < | < |
| Greater-than | > | > |
| Quotes | " | " |
| Apostrophe | ' | ' |
We cannot do anything if you write in XML, those must be escaped.