Html2Markdown icon indicating copy to clipboard operation
Html2Markdown copied to clipboard

Support OneNote html for bold and italic

Open idvorkin opened this issue 7 years ago • 10 comments

Onenote encodes its HTML pages in a way that's close to what Html2Markdown supports but Onenote HTML does bold and italics as follow:

Property Example
font-style style="font-style:italic" (normal or italic only)
font-weight style="font-weight:bold" (normal or bold only)
strike-through style="text-decoration:line-through"
text-align style="text-align:center" (for block elements only)
text-decoration style="text-decoration:underline" (none or underline only)

I'm willing to make the changes if you tell me how you want me to fix.

idvorkin avatar Sep 16 '17 13:09 idvorkin

@idvorkin - let me complete #61 first. This will make it more straightforward to implement.

baynezy avatar Sep 17 '17 04:09 baynezy

@idvorkin - #61 is complete. If you want to support Onenote HTML. You will need to create a new IScheme implementation, you can extend Markdown. Let me know if that doesn't make sense, or you need help.

baynezy avatar Sep 17 '17 06:09 baynezy

I was thinking the OneNote HTML representation of font properties would apply to other tools generating HTML, so we should have it be something the default converter understands.

Based on that, I'm think we'd implement by creating a new CustomerReplacer.CustomAction, which I'd include in the MarkDown._replacers list. Am I on the right track?

idvorkin avatar Sep 17 '17 18:09 idvorkin

I was thinking of only implementing these font decorations when they appear in span elements (where I normally observe them). The spec says these styles can also appear in other elements, where it gets trickier to implement.

Thinking out loud, if we want to implement for non span elements we can do a two pass approach:

  1. Add a span element around the original element content.
  2. Run span replacement.

For example, imagine the following input:

  <_h1 style="bold"> BLAH> </h1> 

Step 1: Spanify

 <_h1> <_span style="BOLD"> BLAH></span> </xh1> 

Step 2: Run span transformer.

idvorkin avatar Sep 17 '17 18:09 idvorkin

@idvorkin - Please don't modify Markdown that is for support of the vanilla Markdown spec. To support OneNote create a OneNote implementation of IScheme extending Markdown as outlined. The functions for the parsing can live in either your new class or you can put them in HtmlParser.

baynezy avatar Sep 19 '17 10:09 baynezy

As they say weeks of coding can save hours of design :) Happy to sync in chat/voice/video if that's fastest

I'd love to better understand your design choice. How do you decide when an HTML representation should be part of the core converter vs a different scheme? The <strong> element you mention is an excellent example. I'd expect it to map to bold in markdown.

idvorkin avatar Sep 19 '17 14:09 idvorkin

https://gitter.im/Html2Markdown/issue-60

baynezy avatar Sep 23 '17 16:09 baynezy

FYI, for the transform approach I'm thinking something like this:

var styleToElementName = new Dictionary<string, string>()
{
	{"font-weight:bold","b"},
	{"font-style:italics","i"},
};

var onenoteHTML = @"<td style=''><span style='font-weight:bold'>Expected Bold </span></td>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(onenoteHTML);

foreach (var s2e in styleToElementName)
{
	var styledElements = doc.DocumentNode.SelectNodes($"//span[@style='{s2e.Key}']");
	foreach (var element in styledElements)
	{
		element.Name = s2e.Value;
		element.Attributes.Where(a => a.Name == "style" && a.Value == s2e.Key).ToList()
               .ForEach(a => element.Attributes.Remove(a));
	}
}

idvorkin avatar Sep 24 '17 14:09 idvorkin

There are OneNote fixes which work for me. I assume that tables don't have line breaks, otherwise this neds extra processing (replacing with br tag):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Html2Markdown.Replacement;
using Html2Markdown.Scheme;
using HtmlAgilityPack;

namespace OneSyncTool.Core
{
   class Html2MarkdownScheme : IScheme
   {
      private readonly Markdown _builtIn = new Markdown();
      private readonly List<IReplacer> _replacers;

      public Html2MarkdownScheme()
      {
         _replacers = new List<IReplacer>(_builtIn.Replacers());

         //OneNote block decoration
         _replacers.Add(new PatternReplacer("<div\\s+style\\s*=\\s*\"position:absolute(.+?)>", ""));
         _replacers.Add(new PatternReplacer("</div>", ""));

         //everything else
         _replacers.Add(new OneNoteHapReplacer());
      }

      public IList<IReplacer> Replacers() => _replacers;

      internal class OneNoteHapReplacer : IReplacer
      {
         public string Replace(string html)
         {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            ProcessFontStyles(doc);
            ProcessTables(doc);

            return doc.DocumentNode.OuterHtml;
         }

         private void ProcessFontStyles(HtmlDocument doc)
         {
            HtmlNodeCollection fontStyles = doc.DocumentNode.SelectNodes("//span[@style]");
            foreach (HtmlNode node in fontStyles)
            {
               string style = node.GetAttributeValue("style", null);
               if (style == null) continue;

               string[] styles = style.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim()).ToArray();
               var decorations = new List<string>();
               if (styles.Contains("font-style:italic")) decorations.Add("_");
               if (styles.Contains("font-weight:bold")) decorations.Add("**");
               if (styles.Contains("font-decoration:line-through")) decorations.Add("~~");
               // there's no underline in markdown? ignore it for now

               string replacement = Decorate(node.InnerHtml, decorations);

               node.ParentNode.ReplaceChild(doc.CreateTextNode(node.InnerHtml), node);
            }
         }

         private void ProcessTables(HtmlDocument doc)
         {
            HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("table");
            foreach(HtmlNode table in tables)
            {
               var s = new StringBuilder();
               bool isHeader = true;

               //there are text nodes in children, they are just line breaks and safe to ignore
               foreach(HtmlNode row in table.ChildNodes.Where(n => n.Name == "tr"))
               {
                  int cellCount = 0;
                  s.Append("|");
                  foreach(HtmlNode cell in row.ChildNodes.Where(n => n.Name == "td"))
                  {
                     s.Append(cell.InnerText.Trim());
                     s.Append("|");
                     cellCount++;
                  }
                  s.AppendLine();

                  if(isHeader)
                  {
                     s.Append("|");
                     for(int i = 0; i < cellCount; i++)
                     {
                        s.Append("-|");
                     }
                     s.AppendLine();
                     isHeader = false;
                  }
               }

               table.ParentNode.ReplaceChild(doc.CreateTextNode(s.ToString()), table);
            }
         }

         private string Decorate(string text, IReadOnlyCollection<string> decorations)
         {
            foreach(string dec in decorations)
            {
               text = dec + text + text;
            }

            return text + Environment.NewLine; //append new line because it's in a span
         }
      }

      internal class PatternReplacer : IReplacer
      {
         public PatternReplacer(string pattern, string replacement)
         {
            Pattern = pattern;
            Replacement = replacement;
         }

         public string Pattern { get; }

         public string Replacement { get; }

         public string Replace(string html)
         {
            return new Regex(Pattern).Replace(html, Replacement);
         }
      }
   }
}

aloneguid avatar Jan 09 '19 12:01 aloneguid

Just to demo it, original onenote page:

image

exported to markdown:

image

aloneguid avatar Jan 09 '19 12:01 aloneguid