atomic-server icon indicating copy to clipboard operation
atomic-server copied to clipboard

Import HTML pages

Open joepio opened this issue 3 years ago • 4 comments

Basically a Pocket-like bookmarking tool that saves HTML pages in a friendly way. If Atomic Server persists this, we immediately get search and edit features. Quite useful!

So how to realize this?

It's probably a good idea to simply emit JSON-AD which can be parsed by importer #390

There's a good chance it won't even be part of this repo.

In the browser

  • Lots of JS tools can parse HTML (e.g. Readability)
  • Cors will be difficult if we run this in the browser

Using html2md / rist

We can use the Content-Type header to check if we get a JSON or an HTML page back. We'd need some sort of rust library for parsing the HTML and converting it into MarkDown or something. Paperoni is a CLI that turns HTML into epub, HTML2MD converts HTML to markdown. Or perhaps use the underlying html5ever crate.

@Polleps will work on this. @AlexMikhalev you might want to share some thoughts on this, too!

I tried this:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let res = reqwest::blocking::get("https://nos.nl/artikel/2432204-tien-keer-levenslang-geeist-voor-aanslagen-parijs-in-2015")?;
    let md = html2md::parse_html(&res.text()?);
    println!("{md}");
    Ok(())
}

But got pretty bad results - all the nav items were taken into account, too. I created an issue for a reader feature request.

Or you could parse the HTML (with html5ever) and then select the main or article element, and pass that to html2md

joepio avatar Jun 10 '22 13:06 joepio

A common import tool with Content-type detection sounds good. I think we need to parse into an internal Atomic Data server structure, so both HTML import and Markdown imports create atomic data "documents" and then render them back to users on runtime, for example, using https://github.com/raphlinus/pulldown-cmark if we need output markdown (cargo doc uses it). For HTML parsing I see a lot of examples in crawlers using scraper and selector: https://hackernoon.com/parsing-html-with-rust-a-simple-tutorial-using-tokio-reqwest-and-scraper and https://kerkour.com/rust-crawler-scraping-and-parsing-html , using this crate https://github.com/utkarshkukreti/select.rs. I haven't tried them yet.

AlexMikhalev avatar Jun 11 '22 08:06 AlexMikhalev

@joepio how will the data model/struct look for the bookmark?

AlexMikhalev avatar Jun 11 '22 08:06 AlexMikhalev

To clarify comments above: we are writing specific importer, we can re-use the pattern from crawlers where we can specifically select importer from command line parameter: Atomic-data-cli import bookmarks source Firefox|Chrome Importer shall check if bookmarks data type configured in atomic server and than import HTML using select crate - we need specific content mapped to specific fields in bookmarks data type. By the way why we are parsing HTML for bookmarks? AFAIK it's xpath based XHTML.

AlexMikhalev avatar Jun 11 '22 10:06 AlexMikhalev

After spending some time thinking about it: let's make sure we separate bookmark imports/sync and storing and taking snapshots. Taking snapshots can be done after we will have basic bookmarks and its variation of running browser via selenium and dump output. @joepio do we have a ticket for bookmark import? The detailed specification for bookmarks mozilla

AlexMikhalev avatar Jun 12 '22 20:06 AlexMikhalev