html2data
                                
                                 html2data copied to clipboard
                                
                                    html2data copied to clipboard
                            
                            
                            
                        Library and cli for extracting data from HTML via CSS selectors
html2data
Library and cli-utility for extracting data from HTML via CSS selectors
Install
Install package and command line utility:
go install github.com/msoap/html2data/cmd/html2data@latest
Install package only:
go get -u github.com/msoap/html2data
Methods
- FromReader(io.Reader)- create document for parse
- FromURL(URL, [config URLCfg])- create document from http(s) URL
- FromFile(file)- create document from local file
- doc.GetData(css map[string]string)- get texts by CSS selectors
- doc.GetDataFirst(css map[string]string)- get texts by CSS selectors, get first entry for each selector or ""
- doc.GetDataNested(outerCss string, css map[string]string)- extract nested data by CSS-selectors from another CSS-selector
- doc.GetDataNestedFirst(outerCss string, css map[string]string)- extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""
- doc.GetDataSingle(css string)- get one result by one CSS selector
or with config:
- doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
- doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
- doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})
Pseudo-selectors
- :attr(attr_name)- getting attribute instead of text, for example getting urls from links:- a:attr(href)
- :html- getting HTML instead of text
- :get(N)- getting n-th element from list
Example
package main
import (
    "fmt"
    "log"
    "github.com/msoap/html2data"
)
func main() {
    doc := html2data.FromURL("http://example.com")
    // or with config
    // doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
    if doc.Err != nil {
        log.Fatal(doc.Err)
    }
    // get title
    title, _ := doc.GetDataSingle("title")
    fmt.Println("Title is:", title)
    title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
    fmt.Println("Title as is, with spaces:", title)
    texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
    // get all H1 headers:
    if textOne, ok := texts["h1"]; ok {
        for _, text := range textOne {
            fmt.Println(text)
        }
    }
    // get all urls from links
    if links, ok := texts["links"]; ok {
        for _, text := range links {
            fmt.Println(text)
        }
    }
}
Command line utility
Usage
html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"
Options
- -user-agent="Custom UA"-- set custom user-agent
- -find-in="outer.css.selector"-- search in the specified elements instead document
- -json-- get result as JSON
- -dont-trim-spaces-- get text as is
- -dont-detect-charset-- don't detect charset and convert text
- -timeout=10-- setting timeout when loading the URL
Install
Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)
Or install from homebrew (MacOS):
brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data
Using snap (Ubuntu or any Linux distribution with snap):
# install stable version:
sudo snap install html2data
# install the latest version:
sudo snap install --edge html2data
# update
sudo snap refresh html2data
From source:
go get -u github.com/msoap/html2data/cmd/html2data
examples
Get title of page:
html2data https://go.dev/ title
Last blog posts:
html2data https://go.dev/blog/ 'div#blogindex p.blogtitle a'
Getting RSS URL:
html2data https://go.dev/blog/ 'link[type="application/atom+xml"]:attr(href)'
More examples from wiki.