floki icon indicating copy to clipboard operation
floki copied to clipboard

Create a built in HTML parser

Open philss opened this issue 10 years ago • 16 comments

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

  • [ ] support HTML5;
  • [ ] support HTML snippets;
  • [ ] be able to parse large files, like 15MB;
  • [ ] easy to traverse;
  • [ ] be a bit tolerant with errors, like missing closing tags.

philss avatar Oct 29 '15 04:10 philss

Here is a test case with an example of error that Floki does not support today: https://github.com/henrik/sipper/commit/49a4c09afa8773f9253401608f89c8d1545124cf

Thanks @henrik for the example!

philss avatar Dec 09 '15 19:12 philss

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

gmile avatar Jun 07 '16 17:06 gmile

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency. This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

philss avatar Jun 09 '16 15:06 philss

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

gmile avatar Jun 09 '16 15:06 gmile

@gmile I'm not looking into this right now. So, please go for it. 👍

philss avatar Jun 09 '16 15:06 philss

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

baron avatar Jul 12 '16 11:07 baron

It would be awesome to have something like this:

%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}

instead of:

{"p", [], []}
"content"
{comment: "content"}

I was think also about:

Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs

Features:

  • [ ] support all CSS3 (CSS4?) selectors
  • [ ] support XPath
  • [ ] log warnings when parsing + add option to raise on warning
  • [ ] add option to strip blank text node (default false)
  • [ ] add option to strip comment content (default true)
  • [ ] use Stream when possible
  • [ ] tag names and attribute names are always lower case like: "my-custom-tag" and "my-custom-data"
  • [ ] support detect encoding
  • [ ] allow validate only
  • [ ] support fetching parent(s) and sibling(s) from leaf struct ...
  • [ ] debug logs - for example: "missing title", "missing favicon" ...

Optional features:

  • [ ] method to collect styles for node (with priority, source file, line ...)
  • [ ] method to collect events for node
  • [ ] extra JQuery selectors, see docs
  • [ ] CSS validator with warnings/errors
<div style='fontt-color: white;'></div>

Eiji7 avatar Dec 20 '16 16:12 Eiji7

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

ghost avatar Jan 14 '17 23:01 ghost

@mhsjlw I agree. Please follow this issue for more details: https://github.com/philss/floki/issues/94 (sorry for the delay 😅 ).

philss avatar Mar 14 '17 04:03 philss

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

philss avatar Mar 14 '17 04:03 philss

@philss wow, that's awesome! Thanks!

gmile avatar Mar 14 '17 10:03 gmile

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

liveresume avatar Mar 21 '17 20:03 liveresume

@liveresume this was mentioned, twice, see https://github.com/philss/floki/issues/37#issuecomment-272662395 and https://github.com/philss/floki/issues/37#issuecomment-286318944

ghost avatar Mar 22 '17 00:03 ghost

Please have a look at: https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin I would love to see this coming together!

f34nk avatar Feb 21 '18 15:02 f34nk

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

Overbryd avatar Feb 21 '18 15:02 Overbryd

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?

philss avatar Feb 22 '18 01:02 philss