Bidirectional parsing with sgml
I think it would be useful to have bidirectional parsing of HTML/XML, something like:
?- load_html("<code>(foo (+ 2 3))</code>", X, []), load_html(C, X, []).
error(instantiation_error,load_html/3).
The use-case would be to have a way to build/transform HTML/XML before creating it as a string to use for a server or whatever else.
Currently the library also inserts tags that didn't exist in the string, so that might need to be addressed as part of it:
?- load_html("<code>(foo (+ 2 3))</code>", X, []).
X = [element(html,[],[element(head,[],[]),element(body,[],[element(code,[],["(foo (+ 2 3 ..."])])])].
I think one way to avoid the unwanted tags is to use load_xml/2 instead of load_html/2. For example, we get:
?- load_xml("(foo (+ 2 3))", DOM, []).
DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])].
We can use a DCG to relate such a DOM representation to a list of characters:
:- use_module(library(dcgs)).
:- use_module(library(format)).
elements_string([]) --> [].
elements_string([E|Es]) -->
element_string(E),
elements_string(Es).
element_string([C|Cs]) --> seq([C|Cs]).
element_string(element(Name, Attrs, Cs)) -->
format_("<~w", [Name]),
attributes(Attrs),
">\n",
elements_string(Cs),
format_("~n<~w>~n", [Name]).
attributes([]) --> [].
attributes([A|As]) --> " ", attributes_([A|As]).
attributes_([]) --> [].
attributes_([Name=Value|As]) -->
format_("~w=\"~s\"", [Name,Value]),
attributes(As).
Yielding:
?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []),
phrase(elements_string(DOM), Cs).
DOM = [element(code,[x="123",b="cde"],["(foo (+ 2 3))"])],
Cs = "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n".
Emitting it with format/2 yields:
?- load_xml("<code x=\"123\" b=\"cde\">(foo (+ 2 3))</code>", DOM, []),
phrase(elements_string(DOM), Cs),
format("~s", [Cs]).
<code x="123" b="cde">
(foo (+ 2 3))
<code>
...
Does this help?
Note that load_html/2 and load_xml/2 support several different sources in addition to lists of characters, so converting a DOM to only a list of characters would be incomplete.
Yeah, thanks! that's really helpful.
Is there a way to run the DCG example in the reverse direction, something like phrase(elements_string(DOM), "<code x=\"123\" b=\"cde\">\n(foo (+ 2 3))\n<code>\n")?
Sorry if it's a dumb question, I admittedly need to brush up on DCGs. I thought they were bidirectional because they're syntactic sugar on regular relations, but it just threw a list of single character strings in an infinite loop. I'm wondering because I might need to port it to Tau Prolog for the frontend, and the sgml library uses Rust internals.
Unrelated: I really like your Power of Prolog series a lot on your site and Youtube. I love Prolog, and it inspires me to work with it a lot more.
Thank you, thank you, I am glad you find the material useful!
Parsing HTML is harder than generating it from the DOM, so Tau Prolog may benefit from similar engine-powered facilities to easily parse HTML. An alternative may be to use the newly available WASM port of Scryer Prolog for the frontend, please see https://github.com/mthom/scryer-prolog/discussions/2005 for more information!