Html2Markdown
Html2Markdown copied to clipboard
Support OneNote html for bold and italic
Onenote encodes its HTML pages in a way that's close to what Html2Markdown supports but Onenote HTML does bold and italics as follow:
Property | Example |
---|---|
font-style | style="font-style:italic" (normal or italic only) |
font-weight | style="font-weight:bold" (normal or bold only) |
strike-through | style="text-decoration:line-through" |
text-align | style="text-align:center" (for block elements only) |
text-decoration | style="text-decoration:underline" (none or underline only) |
I'm willing to make the changes if you tell me how you want me to fix.
@idvorkin - let me complete #61 first. This will make it more straightforward to implement.
@idvorkin - #61 is complete. If you want to support Onenote HTML. You will need to create a new IScheme
implementation, you can extend Markdown
. Let me know if that doesn't make sense, or you need help.
I was thinking the OneNote HTML representation of font properties would apply to other tools generating HTML, so we should have it be something the default converter understands.
Based on that, I'm think we'd implement by creating a new CustomerReplacer.CustomAction
, which I'd include in the MarkDown._replacers
list. Am I on the right track?
I was thinking of only implementing these font decorations when they appear in span elements (where I normally observe them). The spec says these styles can also appear in other elements, where it gets trickier to implement.
Thinking out loud, if we want to implement for non span elements we can do a two pass approach:
- Add a span element around the original element content.
- Run span replacement.
For example, imagine the following input:
<_h1 style="bold"> BLAH> </h1>
Step 1: Spanify
<_h1> <_span style="BOLD"> BLAH></span> </xh1>
Step 2: Run span transformer.
@idvorkin - Please don't modify Markdown
that is for support of the vanilla Markdown spec. To support OneNote create a OneNote implementation of IScheme
extending Markdown
as outlined. The functions for the parsing can live in either your new class or you can put them in HtmlParser
.
As they say weeks of coding can save hours of design :) Happy to sync in chat/voice/video if that's fastest
I'd love to better understand your design choice. How do you decide when an HTML representation should be part of the core converter vs a different scheme? The <strong> element you mention is an excellent example. I'd expect it to map to bold in markdown.
https://gitter.im/Html2Markdown/issue-60
FYI, for the transform approach I'm thinking something like this:
var styleToElementName = new Dictionary<string, string>()
{
{"font-weight:bold","b"},
{"font-style:italics","i"},
};
var onenoteHTML = @"<td style=''><span style='font-weight:bold'>Expected Bold </span></td>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(onenoteHTML);
foreach (var s2e in styleToElementName)
{
var styledElements = doc.DocumentNode.SelectNodes($"//span[@style='{s2e.Key}']");
foreach (var element in styledElements)
{
element.Name = s2e.Value;
element.Attributes.Where(a => a.Name == "style" && a.Value == s2e.Key).ToList()
.ForEach(a => element.Attributes.Remove(a));
}
}
There are OneNote fixes which work for me. I assume that tables don't have line breaks, otherwise this neds extra processing (replacing with br tag):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Html2Markdown.Replacement;
using Html2Markdown.Scheme;
using HtmlAgilityPack;
namespace OneSyncTool.Core
{
class Html2MarkdownScheme : IScheme
{
private readonly Markdown _builtIn = new Markdown();
private readonly List<IReplacer> _replacers;
public Html2MarkdownScheme()
{
_replacers = new List<IReplacer>(_builtIn.Replacers());
//OneNote block decoration
_replacers.Add(new PatternReplacer("<div\\s+style\\s*=\\s*\"position:absolute(.+?)>", ""));
_replacers.Add(new PatternReplacer("</div>", ""));
//everything else
_replacers.Add(new OneNoteHapReplacer());
}
public IList<IReplacer> Replacers() => _replacers;
internal class OneNoteHapReplacer : IReplacer
{
public string Replace(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
ProcessFontStyles(doc);
ProcessTables(doc);
return doc.DocumentNode.OuterHtml;
}
private void ProcessFontStyles(HtmlDocument doc)
{
HtmlNodeCollection fontStyles = doc.DocumentNode.SelectNodes("//span[@style]");
foreach (HtmlNode node in fontStyles)
{
string style = node.GetAttributeValue("style", null);
if (style == null) continue;
string[] styles = style.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim()).ToArray();
var decorations = new List<string>();
if (styles.Contains("font-style:italic")) decorations.Add("_");
if (styles.Contains("font-weight:bold")) decorations.Add("**");
if (styles.Contains("font-decoration:line-through")) decorations.Add("~~");
// there's no underline in markdown? ignore it for now
string replacement = Decorate(node.InnerHtml, decorations);
node.ParentNode.ReplaceChild(doc.CreateTextNode(node.InnerHtml), node);
}
}
private void ProcessTables(HtmlDocument doc)
{
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("table");
foreach(HtmlNode table in tables)
{
var s = new StringBuilder();
bool isHeader = true;
//there are text nodes in children, they are just line breaks and safe to ignore
foreach(HtmlNode row in table.ChildNodes.Where(n => n.Name == "tr"))
{
int cellCount = 0;
s.Append("|");
foreach(HtmlNode cell in row.ChildNodes.Where(n => n.Name == "td"))
{
s.Append(cell.InnerText.Trim());
s.Append("|");
cellCount++;
}
s.AppendLine();
if(isHeader)
{
s.Append("|");
for(int i = 0; i < cellCount; i++)
{
s.Append("-|");
}
s.AppendLine();
isHeader = false;
}
}
table.ParentNode.ReplaceChild(doc.CreateTextNode(s.ToString()), table);
}
}
private string Decorate(string text, IReadOnlyCollection<string> decorations)
{
foreach(string dec in decorations)
{
text = dec + text + text;
}
return text + Environment.NewLine; //append new line because it's in a span
}
}
internal class PatternReplacer : IReplacer
{
public PatternReplacer(string pattern, string replacement)
{
Pattern = pattern;
Replacement = replacement;
}
public string Pattern { get; }
public string Replacement { get; }
public string Replace(string html)
{
return new Regex(Pattern).Replace(html, Replacement);
}
}
}
}
Just to demo it, original onenote page:
exported to markdown: