html_entities
                                
                                
                                
                                    html_entities copied to clipboard
                            
                            
                            
                        Percent html entity does not decoded
Expected: HtmlEntities.decode("100%") #=> "100%"
Actual: HtmlEntities.decode("100%") #=> "100%"
This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
I can build I mapping file, if you are willing to use it in your project.
Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities.
The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities.
It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance.
Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days.
We do web scrapping a lot, and there are many weird things in the wild :)
Please note there are quite a few entities with multiple codepoints. Also, I've noticed & and & are both valid entities, so I had to sort entities in Util.HtmlCharref.Util.load_entities by their length. Otherwise "Tom & Jerry" could be decoded to "Tom &; Jerry".
My quick solution to this (excerpt from your codebase):
defmodule Util.HtmlCharref do
  def decode(text) when is_binary(text), do: decode(text, [])
  def decode(text), do: text
  # https://html.spec.whatwg.org/entities.json
  @charref_filename "./lib/util/html_charref/entities.txt"
  codes = Util.HtmlCharref.Util.load_entities(@charref_filename)
  for {name, codepoints} <- codes do
    defp decode(<<unquote(name), rest::binary>>, acc) do
      decode(rest, unquote(codepoints) ++ acc)
    end
  end
  defp decode(<<head::utf8, rest::binary>>, acc), do: decode(rest, [head | acc])
  defp decode(<<>>, acc), do: acc |> Enum.reverse() |> List.to_string()
end
P.S. Thank you for a great lib.
Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the [a] footnote, marking the entities in my list that allow no semi-colon.
As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them.
Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table.