html-encoding-sniffer icon indicating copy to clipboard operation
html-encoding-sniffer copied to clipboard

Revamped package

Open fb55 opened this issue 4 years ago • 5 comments

Hi @domenic et al,

I was looking into using this package for cheerio, but had several issues that I wanted to fix first. As extending the current html-encoding-sniffer package turned out to be cumbersome, I opted to write a new module instead:

https://github.com/fb55/encoding-sniffer

This new package implements the current version of the encoding sniffing algo as a state machine. That allows streams to be supported without much effort. Features this supports, which aren't present in html-encoding-sniffer:

  • XML encoding types (UTF-16 prefixes and <?xml encoding="...">)
  • Configurable sniff depth (see #10)
  • x-user-defined in <meta> tags (turns out html-encoding-sniffer's support is broken)
  • Streams / sniffing of incomplete documents

I would love to join forces and have a single package that both jsdom and cheerio can use going forward. Let me know if this is something you'd be interested in!

fb55 avatar Nov 01 '21 14:11 fb55

Interesting.

XML encoding types (UTF-16 prefixes and <?xml encoding="...">)

I believe these (at least the latter; not sure about the former) were only recently added to the HTML spec. I'd love to add support for them to html-encoding-sniffer, following the spec at https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding

Configurable sniff depth (see #10)

Seems reasonable, although I'm not clear on the use case, since AFAIK all browsers use 1024.

x-user-defined in <meta> tags (turns out html-encoding-sniffer's support is broken)

This is not broken; it's following the spec: https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding step 4.case2.15.

Streams / sniffing of incomplete documents

I'm hesitant to add the complexity of streams (especially Node streams) to this package, so this would require some extra discussion as to how we could layer the API into a non-streaming version and a streaming version.

domenic avatar Nov 01 '21 14:11 domenic

x-user-defined in <meta> tags (turns out html-encoding-sniffer's support is broken)

This is not broken; it's following the spec: html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding step 4.case2.15.

whatwgEncoding.labelToName doesn't return a value when passing x-user-defined. So the branch will never be hit.

fb55 avatar Nov 01 '21 14:11 fb55

To be clearer about what I'm proposing:

I would like https://github.com/fb55/encoding-sniffer to become [email protected]. This will include API compatibility & all existing tests will be added to the new project. Repo ownership TBD, it can live in the JSDOM org.

fb55 avatar Nov 01 '21 15:11 fb55

whatwgEncoding.labelToName doesn't return a value when passing x-user-defined. So the branch will never be hit.

Ah, I see, thanks.

I would like https://github.com/fb55/encoding-sniffer to become [email protected].

Oh, sorry, no, I am not interested in that. I thought you were offering to improve html-encoding-sniffer.

domenic avatar Nov 01 '21 15:11 domenic

Oh, sorry, no, I am not interested in that. I thought you were offering to improve html-encoding-sniffer.

No worries, thanks for the reply. Just in case the semantic difference matters to you: I am also happy to open a PR that adds the state machine of fb55/encoding-sniffer to this package, and archive my repo.

The original wording was supposed to cover both, and I honestly just want to have a package that covers the use-cases I have for cheerio.

fb55 avatar Nov 01 '21 15:11 fb55