readability.rs
readability.rs copied to clipboard
Does this work?
Hey, I just stumbled upon this repo, and it seems that you have ported the famous readability algorithm into rust, using kuchiki and therefore html5ever. First: truly great!
But it seems that this algo does crash when used on actual HTML websites, I get panics like
1: 0x10a9bc24c - std::sys::imp::backtrace::tracing::imp::write::hf587afb8e94ad165
2: 0x10a9be23e - std::panicking::default_hook::{{closure}}::haf3443cb412055ce
3: 0x10a9bdde3 - std::panicking::default_hook::h742f925bfab3bbfa
4: 0x10a9be6f7 - std::panicking::rust_panic_with_hook::h6f06ff8d28a94df6
5: 0x10a9be5a4 - std::panicking::begin_panic::h7b9167ba3324cfae
6: 0x10a9be4c2 - std::panicking::begin_panic_fmt::hb5f8f1fe0fe23e28
7: 0x10a9be427 - rust_begin_unwind
8: 0x10a9e5e60 - core::panicking::panic_fmt::he6eb92dab4407c61
9: 0x10a9e5eed - core::option::expect_failed::hf8bba00a6e833438
10: 0x10a70f373 - <core::option::Option<T>>::expect::hba43ec4f65591df2
11: 0x10a6cf697 - <std::collections::hash::map::HashMap<K, V, S> as core::ops::Index<&'a Q>>::index::he1febf3b2b851612
12: 0x10a782795 - readability::Readability::add_info::h3257b725054a9642
13: 0x10a782026 - readability::Readability::readify::h110ae48756961de8
14: 0x10a781a7a - readability::Readability::parse::h69c7871f90548046
Maybe this repo needs also some small polish, like publishing on crates.io and a README with a short "how to use". I just figured out that
readability::new().parse(&html_string).text_contents()
works more or less to get started, but I tinkered with kuchiki before. Do you want some help? I might not be of good use for the algorithmic side in Rust yet, but when you have a working state of this crate I'd like to write some docs for you in exchange. What dou you think?
Hi, thank you for your attention! I plan to go back to the project next month (I need it in my degree work). I will need to port mozilla's tests and some heuristics to improve precision. Also it's good to abstract the library over any DOM, not only kuchiki.
... actual HTML websites, I get panics like
Can you provide me a webpage that you used when you got this error?