atomic-server
                                
                                 atomic-server copied to clipboard
                                
                                    atomic-server copied to clipboard
                            
                            
                            
                        Import HTML pages
Basically a Pocket-like bookmarking tool that saves HTML pages in a friendly way. If Atomic Server persists this, we immediately get search and edit features. Quite useful!
So how to realize this?
It's probably a good idea to simply emit JSON-AD which can be parsed by importer #390
There's a good chance it won't even be part of this repo.
In the browser
- Lots of JS tools can parse HTML (e.g. Readability)
- Cors will be difficult if we run this in the browser
Using html2md / rist
We can use the Content-Type header to check if we get a JSON or an HTML page back. We'd need some sort of rust library for parsing the HTML and converting it into MarkDown or something. Paperoni is a CLI that turns HTML into epub, HTML2MD converts HTML to markdown. Or perhaps use the underlying html5ever crate.
@Polleps will work on this. @AlexMikhalev you might want to share some thoughts on this, too!
I tried this:
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let res = reqwest::blocking::get("https://nos.nl/artikel/2432204-tien-keer-levenslang-geeist-voor-aanslagen-parijs-in-2015")?;
    let md = html2md::parse_html(&res.text()?);
    println!("{md}");
    Ok(())
}
But got pretty bad results - all the nav items were taken into account, too. I created an issue  for a reader feature request.
Or you could parse the HTML (with html5ever) and then select the main or article element, and pass that to html2md
A common import tool with Content-type detection sounds good. I think we need to parse into an internal Atomic Data server structure, so both HTML import and Markdown imports create atomic data "documents" and then render them back to users on runtime, for example, using https://github.com/raphlinus/pulldown-cmark if we need output markdown (cargo doc uses it). For HTML parsing I see a lot of examples in crawlers using scraper and selector: https://hackernoon.com/parsing-html-with-rust-a-simple-tutorial-using-tokio-reqwest-and-scraper and https://kerkour.com/rust-crawler-scraping-and-parsing-html , using this crate https://github.com/utkarshkukreti/select.rs. I haven't tried them yet.
@joepio how will the data model/struct look for the bookmark?
To clarify comments above: we are writing specific importer, we can re-use the pattern from crawlers where we can specifically select importer from command line parameter: Atomic-data-cli import bookmarks source Firefox|Chrome Importer shall check if bookmarks data type configured in atomic server and than import HTML using select crate - we need specific content mapped to specific fields in bookmarks data type. By the way why we are parsing HTML for bookmarks? AFAIK it's xpath based XHTML.
After spending some time thinking about it: let's make sure we separate bookmark imports/sync and storing and taking snapshots. Taking snapshots can be done after we will have basic bookmarks and its variation of running browser via selenium and dump output. @joepio do we have a ticket for bookmark import? The detailed specification for bookmarks mozilla