theming-demo icon indicating copy to clipboard operation
theming-demo copied to clipboard

Parsing non-UTF-8 pages

Open edevil opened this issue 7 years ago • 3 comments

Parsing pages not written in UTF-8 currently produces errors:

> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}

In this case this XML feed has the encoding in the xml preeamble:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Can I get around this problem or can the library be fixed to handle this situation?

edevil avatar May 31 '17 15:05 edevil

I'll leave the broader question of "can the library be fixed to handle this situation?" to Hans, but-

Can I get around this problem

Yeah, to some definition of get around.

body
|> Codepagex.to_string!(:iso_8859_1)
|> Html5ever.parse()

mischov avatar May 31 '17 17:05 mischov

Thanks, @mischov!

edevil avatar Jun 01 '17 11:06 edevil

Going to keep this open, I would still like to find a proper solution for this.

As far as I can tell, html5ever does not support detecting encoding yet. See this issue.

hansihe avatar Jun 01 '17 11:06 hansihe