HTMLKit icon indicating copy to clipboard operation
HTMLKit copied to clipboard

Implement HTML escaping for arbitrary string input

Open guidedways opened this issue 7 years ago • 6 comments

This looks like a powerful library to navigate around HTML nodes, however what would be the simplest method of obtaining cleaned up 'plain text' from HTML input? I'd like it to preserve any 'invalid' non-html tags such as John Do <[email protected]> and not try and parse it as NSAttributedString's initWithHTML does.

guidedways avatar May 10 '18 18:05 guidedways

Okay the following seems to fail

let element:HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do <[email protected]>"
print("\(element.textContent)")

outputs: This is an email: John Do

What do I have to do to make this work so that it ignores anything that doesn't look like HTML?

guidedways avatar May 10 '18 18:05 guidedways

@guidedways Hey there. Let me see if I understood you correctly.

You want to input a HTML string and have all HTML tags stripped, as in This is an <b>email</b>: John Do <[email protected]> should return This is an email: John Do <[email protected]>?

If so, then the easiest way to do it, is to escape all HTML reserved characters to prevent interpreting them as HTML. In your case:

let element: HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "This is an <b>email</b>: John Do &lt;[email protected]&gt;"
print("\(element.textContent)")
// This is an email: John Do <[email protected]>

Some Details

innerHTML in HTMLKit behaves like it would in a browser, i.e. it sets the HTML content of an element to the string that is passed. The string is then interpreted as a HTML fragment and is parsed inside the element as its parent context.

What does it mean? Well, your input gets parsed to this DOM:

<div>This is an  <b>email</b>: John Do <[email protected]></[email protected]></div>

Take a look here for more info: MDN Element.innerHTML

Does this answer you question? Do you have any followup questions?

iabudiab avatar May 10 '18 20:05 iabudiab

Yes that is the output I'm after, but I am not in control of the string being received from the user. It could be anything <some strange non-html tag>. I need the library to be able to do this for me so I can escape < as &lt;. Can HTMLKit find and escape non-html 'tags' for me?

guidedways avatar May 10 '18 20:05 guidedways

I should explain. I'm receiving input directly from the user as notes. The notes could be actual HTML or could be partial / invalid HTML. There's no way to tell since they're free to type in whatever they wish. What I need to do is be able to parse HTML and extract the plain text version of whatever they entered, however I need to retain any such odd entries, links etc that aren't otherwise entered as HTML.

guidedways avatar May 10 '18 20:05 guidedways

@guidedways I see, currently HTMLKit does not provide this functionality. I'll see if I could implement this in the next couple of days. Will let you know as soon as I have something.

I'll rename the issue then and mark as feature request.

iabudiab avatar May 10 '18 20:05 iabudiab

Thank you, that would be extremely helpful!

guidedways avatar May 10 '18 20:05 guidedways