hast
hast copied to clipboard
Hypertext Abstract Syntax Tree format
Hypertext Abstract Syntax Tree format.
hast is a specification for representing HTML (and embedded SVG or MathML) as an abstract syntax tree. It implements the unist spec.
This document may not be released.
See releases for released documents.
The latest released version is 2.4.0
.
Contents
-
Introduction
- Where this specification fits
- Virtual DOM
-
Nodes
-
Parent
-
Literal
-
Root
-
Element
-
Doctype
-
Comment
-
Text
-
- Glossary
- List of utilities
- Related HTML utilities
- References
- Security
- Related
- Contribute
- Acknowledgments
- License
Introduction
This document defines a format for representing hypertext as an abstract syntax tree. Development of hast started in April 2016 for rehype. This specification is written in a Web IDL-like grammar.
Where this specification fits
hast extends unist, a format for syntax trees, to benefit from its ecosystem of utilities.
hast relates to JavaScript in that it has an ecosystem of utilities for working with compliant syntax trees in JavaScript. However, hast is not limited to JavaScript and can be used in other programming languages.
hast relates to the unified and rehype projects in that hast syntax trees are used throughout their ecosystems.
Virtual DOM
The reason for introducing a new “virtual” DOM is primarily:
- The DOM is very heavy to implement outside of the browser, a lean and stripped down virtual DOM can be used everywhere
- Most virtual DOMs do not focus on ease of use in transformations
- Other virtual DOMs cannot represent the syntax of HTML in its entirety (think comments and document types)
- Neither the DOM nor virtual DOMs focus on positional information
Nodes
Parent
interface Parent <: UnistParent {
children: [Element | Doctype | Comment | Text]
}
Parent (UnistParent) represents a node in hast containing other nodes (said to be children).
Its content is limited to only other hast content.
Literal
interface Literal <: UnistLiteral {
value: string
}
Literal (UnistLiteral) represents a node in hast containing a value.
Root
interface Root <: Parent {
type: "root"
}
Root (Parent) represents a document.
Root can be used as the root of a tree, or as
a value of the content
field on a 'template'
Element,
never as a child.
Element
interface Element <: Parent {
type: "element"
tagName: string
properties: Properties?
content: Root?
children: [Element | Comment | Text]
}
Element (Parent) represents an Element ([DOM]).
A tagName
field must be present.
It represents the element’s local name ([DOM]).
The properties
field represents information associated with the element.
The value of the properties
field implements the
Properties interface.
If the tagName
field is 'template'
, a content
field can be present.
The value of the content
field implements the Root interface.
If the tagName
field is 'template'
, the element must be a
leaf.
If the tagName
field is 'noscript'
, its children should
be represented as if scripting is disabled
([HTML]).
For example, the following HTML:
<a href="https://alpha.com" class="bravo" download></a>
Yields:
{
type: 'element',
tagName: 'a',
properties: {
href: 'https://alpha.com',
className: ['bravo'],
download: true
},
children: []
}
Properties
interface Properties {}
Properties represents information associated with an element.
Every field must be a PropertyName and every value a PropertyValue.
PropertyName
typedef string PropertyName
Property names are keys on Properties objects and reflect
HTML, SVG, ARIA, XML, XMLNS, or XLink attribute names.
Often, they have the same value as the corresponding attribute (for example,
id
is a property name reflecting the id
attribute name), but there are some
notable differences.
These rules aren’t simple. Use
hastscript
(orproperty-information
directly) to help.
The following rules are used to transform HTML attribute names to property names. These rules are based on how ARIA is reflected in the DOM ([ARIA]), and differs from how some (older) HTML attributes are reflected in the DOM.
- Any name referencing a combinations of multiple words (such as “stroke
miter limit”) becomes a camelcased property name capitalizing each word
boundary.
This includes combinations that are sometimes written as several words.
For example,
stroke-miterlimit
becomesstrokeMiterLimit
,autocorrect
becomesautoCorrect
, andallowfullscreen
becomesallowFullScreen
. - Any name that can be hyphenated, becomes a camelcased property name
capitalizing each boundary.
For example, “read-only” becomes
readOnly
. - Compound words that are not used with spaces or hyphens are treated as a normal word and the previous rules apply. For example, “placeholder”, “strikethrough”, and “playback” stay the same.
- Acronyms in names are treated as a normal word and the previous rules apply.
For example,
itemid
becomeitemId
andbgcolor
becomesbgColor
.
Exceptions
Some jargon is seen as one word even though it may not be seen as such by
dictionaries.
For example, nohref
becomes noHref
, playsinline
becomes playsInline
,
and accept-charset
becomes acceptCharset
.
The HTML attributes class
and for
respectively become className
and
htmlFor
in alignment with the DOM.
No other attributes gain different names as properties, other than a change in
casing.
Notes
property-information
lists all property names.
The property name rules differ from how HTML is reflected in the DOM for the following attributes:
View list of differences
-
charoff
becomescharOff
(notchOff
) -
char
stayschar
(does not becomech
) -
rel
staysrel
(does not becomerelList
) -
checked
stayschecked
(does not becomedefaultChecked
) -
muted
staysmuted
(does not becomedefaultMuted
) -
value
staysvalue
(does not becomedefaultValue
) -
selected
staysselected
(does not becomedefaultSelected
) -
allowfullscreen
becomesallowFullScreen
(notallowFullscreen
) -
hreflang
becomeshrefLang
, nothreflang
-
autoplay
becomesautoPlay
, notautoplay
-
autocomplete
becomesautoComplete
(notautocomplete
) -
autofocus
becomesautoFocus
, notautofocus
-
enctype
becomesencType
, notenctype
-
formenctype
becomesformEncType
(notformEnctype
) -
vspace
becomesvSpace
, notvspace
-
hspace
becomeshSpace
, nothspace
-
lowsrc
becomeslowSrc
, notlowsrc
PropertyValue
typedef any PropertyValue
Property values should reflect the data type determined by their property name.
For example, the HTML <div hidden></div>
has a hidden
attribute, which is
reflected as a hidden
property name set to the property value true
, and
<input minlength="5">
, which has a minlength
attribute, is reflected as a
minLength
property name set to the property value 5
.
In JSON, the value
null
must be treated as if the property was not included. In JavaScript, bothnull
andundefined
must be similarly ignored.
The DOM has strict rules on how it coerces HTML to expected values, whereas hast
is more lenient in how it reflects the source.
Where the DOM treats <div hidden="no"></div>
as having a value of true
and
<img width="yes">
as having a value of 0
, these should be reflected as
'no'
and 'yes'
, respectively, in hast.
The reason for this is to allow plugins and utilities to inspect these non-standard values.
The DOM also specifies comma separated and space separated lists attribute
values.
In hast, these should be treated as ordered lists.
For example, <div class="alpha bravo"></div>
is represented as ['alpha', 'bravo']
.
There’s no special format for the property value of the
style
property name.
Doctype
interface Doctype <: Node {
type: "doctype"
}
Doctype (Node) represents a DocumentType ([DOM]).
For example, the following HTML:
<!doctype html>
Yields:
{type: 'doctype'}
Comment
interface Comment <: Literal {
type: "comment"
}
Comment (Literal) represents a Comment ([DOM]).
For example, the following HTML:
<!--Charlie-->
Yields:
{type: 'comment', value: 'Charlie'}
Text
interface Text <: Literal {
type: "text"
}
Text (Literal) represents a Text ([DOM]).
For example, the following HTML:
<span>Foxtrot</span>
Yields:
{
type: 'element',
tagName: 'span',
properties: {},
children: [{type: 'text', value: 'Foxtrot'}]
}
Glossary
See the unist glossary.
List of utilities
See the unist list of utilities for more utilities.
-
hastscript
— create trees -
hast-to-hyperscript
— transform to something else through a hyperscript DSL -
hast-util-assert
— assert nodes -
hast-util-class-list
— simulate the browser’sclassList
API for hast nodes -
hast-util-classnames
— merge class names together -
hast-util-embedded
— check if a node is an embedded element -
hast-util-excerpt
— truncate the tree to a comment -
hast-util-find-and-replace
— find and replace text in a tree -
hast-util-from-dom
— transform from DOM tree -
hast-util-from-html
— parse from HTML -
hast-util-from-parse5
— transform from Parse5’s AST -
hast-util-from-selector
— parse CSS selectors to nodes -
hast-util-from-string
— set the plain-text value of a node (textContent
) -
hast-util-from-text
— set the plain-text value of a node (innerText
) -
hast-util-from-webparser
— transform Webparser’s AST to hast -
hast-util-has-property
— check if an element has a certain property -
hast-util-heading
— check if a node is heading content -
hast-util-heading-rank
— get the rank (also known as depth or level) of headings -
hast-util-interactive
— check if a node is interactive -
hast-util-is-body-ok-link
— check if alink
element is “Body OK” -
hast-util-is-conditional-comment
— check ifnode
is a conditional comment -
hast-util-is-css-link
— check ifnode
is a CSSlink
-
hast-util-is-css-style
— check ifnode
is a CSSstyle
-
hast-util-is-element
— check ifnode
is a (certain) element -
hast-util-is-event-handler
— check ifproperty
is an event handler -
hast-util-is-javascript
— check ifnode
is a JavaScriptscript
-
hast-util-labelable
— check ifnode
is labelable -
hast-util-menu-state
— check the state of a menu element -
hast-util-parse-selector
— create an element from a simple CSS selector -
hast-util-phrasing
— check if a node is phrasing content -
hast-util-raw
— parse a tree again -
hast-util-reading-time
— estimate the reading time -
hast-util-sanitize
— sanitize nodes -
hast-util-script-supporting
— check ifnode
is script-supporting content -
hast-util-select
—querySelector
,querySelectorAll
, andmatches
-
hast-util-sectioning
— check ifnode
is sectioning content -
hast-util-shift-heading
— change heading rank (depth, level) -
hast-util-table-cell-style
— transform deprecated styling attributes on table cells to inline styles -
hast-util-to-dom
— transform to a DOM tree -
hast-util-to-estree
— transform to estree (JavaScript AST) JSX -
hast-util-to-html
— serialize as HTML -
hast-util-to-jsx
— transform hast to JSX -
hast-util-to-mdast
— transform to mdast (markdown) -
hast-util-to-nlcst
— transform to nlcst (natural language) -
hast-util-to-parse5
— transform to Parse5’s AST -
hast-util-to-portable-text
— transform to portable text -
hast-util-to-string
— get the plain-text value of a node (textContent
) -
hast-util-to-text
— get the plain-text value of a node (innerText
) -
hast-util-to-xast
— transform to xast (xml) -
hast-util-transparent
— check ifnode
is transparent content -
hast-util-truncate
— truncate the tree to a certain number of characters -
hast-util-whitespace
— check ifnode
is inter-element whitespace
Related HTML utilities
-
a-rel
— List of link types forrel
ona
/area
-
aria-attributes
— List of ARIA attributes -
collapse-white-space
— Replace multiple white-space characters with a single space -
comma-separated-tokens
— Parse/stringify comma separated tokens -
html-tag-names
— List of HTML tag names -
html-dangerous-encodings
— List of dangerous HTML character encoding labels -
html-encodings
— List of HTML character encoding labels -
html-element-attributes
— Map of HTML attributes -
html-event-attributes
— List of HTML event handler content attributes -
html-void-elements
— List of void HTML tag names -
link-rel
— List of link types forrel
onlink
-
mathml-tag-names
— List of MathML tag names -
meta-name
— List of values forname
onmeta
-
property-information
— Information on HTML properties -
space-separated-tokens
— Parse/stringify space separated tokens -
svg-tag-names
— List of SVG tag names -
svg-element-attributes
— Map of SVG attributes -
svg-event-attributes
— List of SVG event handler content attributes -
web-namespaces
— Map of web namespaces
References
- unist: Universal Syntax Tree. T. Wormer; et al.
- JavaScript: ECMAScript Language Specification. Ecma International.
- HTML: HTML Standard, A. van Kesteren; et al. WHATWG.
- DOM: DOM Standard, A. van Kesteren, A. Gregor, Ms2ger. WHATWG.
- SVG: Scalable Vector Graphics (SVG), N. Andronikos, R. Atanassov, T. Bah, B. Birtles, B. Brinza, C. Concolato, E. Dahlström, C. Lilley, C. McCormack, D. Schepers, R. Schwerdtfeger, D. Storey, S. Takagi, J. Watt. W3C.
- MathML: Mathematical Markup Language Standard, D. Carlisle, P. Ion, R. Miner. W3C.
- ARIA: Accessible Rich Internet Applications (WAI-ARIA), J. Diggs, J. Craig, S. McCarron, M. Cooper. W3C.
- JSON The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray. IETF.
- Web IDL: Web IDL, C. McCormack. W3C.
Security
As hast represents HTML, and improper use of HTML can open you up to a
cross-site scripting (XSS) attack, improper use of hast is also unsafe.
Always be careful with user input and use hast-util-santize
to
make the hast tree safe.
Related
- mdast — Markdown Abstract Syntax Tree format
- nlcst — Natural Language Concrete Syntax Tree format
- xast — Extensible Abstract Syntax Tree
Contribute
See contributing.md
in syntax-tree/.github
for
ways to get started.
See support.md
for ways to get help.
Ideas for new utilities and tools can be posted in syntax-tree/ideas
.
A curated list of awesome syntax-tree, unist, mdast, hast, xast, and nlcst resources can be found in awesome syntax-tree.
This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.
Acknowledgments
The initial release of this project was authored by @wooorm.
Special thanks to @eush77 for their work, ideas, and incredibly valuable feedback!
Thanks to @andrewburgess, @arobase-che, @arystan-sw, @BarryThePenguin, @brechtcs, @ChristianMurphy, @ChristopherBiscardi, @craftzdog, @cupojoe, @davidtheclark, @derhuerst, @detj, @DxCx, @erquhart, @flurmbo, @Hamms, @Hypercubed, @inklesspen, @jeffal, @jlevy, @Justineo, @lfittl, @kgryte, @kmck, @kthjm, @KyleAMathews, @macklinu, @medfreeman, @Murderlon, @nevik, @nokome, @phiresky, @revolunet, @rhysd, @Rokt33r, @rubys, @s1n, @Sarah-Seo, @sethvincent, @simov, @s1n, @StarpTech, @stefanprobst, @stuff, @subhero24, @tripodsan, @tunnckoCore, @vhf, @voischev, and @zjaml, for contributing to hast and related projects!