ideas
ideas copied to clipboard
DSL and a framework for building DSLs upon other languages
Project description
Background
Let's call a formal language a child language
(please correct me if I'm wrong, I have not found any established terminology on what I gonna discuss here) if its parse tree is specified in terms of relations between AST nodes of an another formal parent language
. So, when you wanna parse such a child language, you parse its parent language first, and then process the parsed AST.
There are some live examples (usually DSLs) of such:
- KaitaiStruct -> JSON (de-facto YAML is used)
- UniGrammar -> JSON
- langs used in Yandex Tomita-parser -> protobuf
- XGBoost trees serialized into JSON -> JSON
- various configs for various software -> JSON, TOML, ini .... (should I create a list of ones?)
As you guess, the hierarchy can be arbitrary deep (though uaving more than 2 levels will likely be not practical because of overhead).
Benefits of such languages are that they inherit the ecosystem of existing tools and libs for parent languages
.
Problems
- Usually such child languages have hand-coded parsers instead or ones generated from formal descriptions called grammars.
- Usually they lack tooling providing syntax highlighting, autocomplete and searching. These tools usually have the features that are implemented in
language servers
following LSP standard.
Solution
This proposed solution is inspired by my experience with child language based on JSON.
-
create a DSL describing child grammars as graph node features.
- Usual grammars describe languages in terms of tokens: their
parent language
is a tokenized plain text, its AST nodes are tokens. - And tokens themselves are described in terms of characters of a certain class and repeation operations. Or sometimes regexps.
- But both parsers and tokenizers are perfectly ordered and flat, it is impossible to rely on hierarchical features to create child grammars, so they rely on types and their order. Our situation is different - we don't extract any information from order, instead we extract information from their subgraphs.
- there are following kinds of parent lang AST nodes:
terminal
s,sequence
s andobject
s.terminals
correspond to strings, numbers and nulls in JSON.sequence
s correspond to repeats ofterminal
s andobject
s.object
s are nodes having multiple subnodes, each of them of a certain type and a certainid
.id
describes semantics of a certain subnode. This info is extracted from a formal grammar of a lang written in UniGrammar or our lang. - then we specify productions as a sequence. Each consists of a matcher and an action. If a matcher matches, the action is applied and no further rules are checked.
- there are following kinds of parent lang AST nodes:
- Usual grammars describe languages in terms of tokens: their
-
encapsulate generation of code related to these features into separate objects called backends
-
generate a parser
-
specify relations between AST nodes and their possible values in a special DSL.
DSL
It is clear that the DSL should be similar to JSON Schema, but it cannot extend it because it is not only about JSON, so it should reason in terms of AST nodes, not types.
Relevant Technology
- jsonschema
- LSP
Complexity and required time
Complexity
- [x] Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
Required time (ETA)
- [x] Much work - The project will take more than a couple of weeks and serious planning is required
Yes, they surely can be utilised, just there is no framework for that in them. I mean not all the languages above json/yaml can be specifies as a JSONSchema, we need completions with custom code.
If we have a JSONSchema, we need to add to each property a function computing the lookup, then in a langserver check for that function, and if it is present, use it, passing it all the needed info.
I am not sure what is the best way to achieve that because the new schema should be readable from all the languages, so no functions.
The variant is to encode not functions in JS sense, but implement a DSL above Yaml/JSON to describe the ways to lookup the suggsstions (usually we want either a content of another property matching some criteria (which can also be encoded in YAML), or a path in some subtree (with adjustable delimiters), or a Yaml ref of a specific type, or ...).
Then we can either transform that DSL into target language or interpret it on the fly.
@ericprud, how much is your jsg aligned with this idea? Does it allow me to rewrite the parser https://github.com/UniGrammar/UniGrammar.py/tree/master/UniGrammar/ownGrammarFormat as a jsongrammar (yes, I know that jsg is in JS, if needed it can be rewritten into another language) and get the generated code doing the same (it may have own objects for AST nodes, but it must do all the parsing (such as splitting a single dict
into multiple AST nodes, nested in each other in the correct order) and validation itself)?
Let's see if I understand well enough to answer.
If you have an abstract syntax (AS), you can create a greenfield DSL to express it and/or take some generic format (XML, RDF, JSON) and use a format-specific schema language to constrain the structures to those that are sensible in your AS.
For the greenfield DSL, you would specify it in terms of a grammar that defines the set of permissible documents, tied to AS structures (like the beige boxes in http://shex.io/shex-semantics/#shexc) and execute it with a home-grown parser or a parser generator output (yacc, Antlr...).
For the generic formats, your schema language would ideally be expressive enough to adequately define the permissible documents. (http://shex.io/shex-semantics/#shexj performs well in that regard.) If the mapping from the parsed data structure to the AS isn't obvious, you'd probably want to add some glue like the aforementioned beige boxes. I don't know that I've seen that. There are certainly plenty of examples where the mapping is assumed to be obvious, like the mapping from FHIR/XML, FHIR/JSON or FHIR/RDF to the documented model.
When you say "DSLs upon other languages", I think you're talking about what I called the format-specific schema (FSS). For "DSL" (all by itself), I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON. ShExJ doesn't do that. If you write an broken internal reference, JSG won't notice. (IIRC, DIDs offered that validation for XML, and I don't know if XML Schema or RelaxNG validators do.)
My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like. I'd like to think that a parser-generator could consume some JSG and emit that python, but thee are too many unknown unknowns for me to call that a prediction.
Aside: I've always thought it would be cool to have a parser construction kit, where you can reference productions in other grammars to build e.g. a native SVG parser rather than an SVG schema that operated over the output of a native XML parser. I'd hoped to prototype that in the yacker a couple decades ago, but never got to it.
I don't quite understand what you meant in the beginning of the text. I am not very familiar to RDF and shex, but according to what I have read it seems like it is a kind of language for logic programming for creating databases of facts. I haven't got the teminology used there.
I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON.
I don't fully understand, what you mean, and taking into account
My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like.
, let me give you an example. UniGrammar is a transpilator for EBNF-like grammars for different parser gens. They are essentially the same for different parssrs gens, there are some nuances are different. UG has an own DSL for grammars. But for syntax we use YAML. Not quite YAML, JSON would fit too, since when YAML is parsed, it results into the same structure as JSON.
Here is the example:
prods:
- id: a # this production name is `a`
ref: b # terminal or nonterminal under name `b` is used
min: 1 # one or more repetitions, in regexp syntax it is `+`
cap: A # make this element available in the parsed AST under the name `A`
This item is "flat", id
and min
and ref
are siblings in YAML AST, which can look like the following in pseudocode:
Dictionary(elements=[
DictElement(key=String("prods"), value=Array([
Dictionary(elements=[
DictElement(key=String("id"), value=String("a")),
DictElement(key=String("ref"), value=String("b"))),
DictElement(key=String("min"), value=Number(1)),
DictElement(key=String("ref"), value=String("b"))),
]),
])
])
But our language AST should be the following:
Grammar(
productions=Productions([
Name("a",
Cap("A",
Iter(
min = 1,
child = Ref(child=<here reference/pointer to object with `id: b`>)
)
)
])
)
Here the types of nodes are not the same as in YAML AST. In YAML AST types of nodes say only about what is written in YAML. But in our AST types of nodes are semantical and domain-specific. The reference to b
is recognized and searched in the index and populated. Also in YAML AST the nodes within the dict are at the same level, but in our AST there are multiple nested nodes. And the way they are nested cannot be changed, if one swaps nestedness, he will get nonsense. So the parser has correctly determined which syntactic features belong to which node and on which order they must be nested. In my handcoded parser it is done via a series of functions applied to each dict, and the order the functions are applied determines which type of node must be closer to the root of the YAML AST subtree.
But I don't like handcoded parsers, they are not easy to read and not easy to port.
Basically my grammar above YAML in natural language can be described as the following statements:
- the root must map to
Grammar
AST node. -
Grammar
AST node is made from aDictionary
of YAML AST - the dict may contain the key
prods
which maps to our propertyproductions
of typeProductions
-
Productions
is itself a collection, that is made from a YAML array of stuff, each element of which must be either aName
, or any other node allowed there, such asSpacer
- When an element is made from a YAML
Dict
, having aDictElement
with thekey
String('id')
whichvalue
is of typeString
, it is aName
, whichid
must be set to the value of theDictElement
and the rest ofDict
is further processed and the result is set tochild
. - When an element is made from a YAML
Dict
, having aDictElement
with thekey
String('cap')
whichvalue
is of typeString
, it is aCap
, whichname
must be set to the value of theDictElement
and the rest ofDict
is further processed and the result is set tochild
. - When an element is made from a YAML
Dict
, having aDictElement
with thekey
String('min')
whichvalue
is of typeNumber
and is greater or equal than0
, it is anIter
, whichmin
must be set to the value of theDictElement
and the rest ofDict
is further processed and the result is set tochild
. - When an element is made from a YAML
Dict
, having aDictElement
with thekey
String('ref')
whichvalue
is of typeString
, it is aRef
, whichchild
must be set to the previously created node of typeName
whichid
is equal to thevalue
of theDictElement
. - each
Iter
must be encapsulted into an ownName
or into an element that must be encapsulted into an ownName
- each
Cap
must be encapsulted into an ownName
Basically the idea described in this issue is to have
- a DSL allowing to specify such constraints easily and declaratively
- a compiler that would generate me the code
- validating the result of YAML parsing
- transforming the result of YAML parsing into my AST
- if there are any errors, reporting them
- a language server, that would assist me when editing YAML. The constraint that the value of
ref
must be a previously-mentionedid
means that when I activate autocompletion afterref:
, the language server should cause the editor suggest me in the drop-down menu only the ones that were used inid
s prior that item.
YAML is just an example. JSON-like markup languages, such as MsgPack, CBOR, ubjson and rison, have almost (i.e. float and int instead of just Number) the same i.e. there may be multiple types for numbers) AST. For other langs, like csv, the similar logic applies.