org-parser
org-parser copied to clipboard
How should the final transformed parse structure look like?
Hi @branch14 @munen ,
According to the worg spec, the (transformed) parse tree should look like this:
(document
(section)
(headline
(section)
(headline)
(headline
(headline))))
As for as I remember, this is different from the Organice parser:
- The Organice parser keeps a flat list of headlines
- The Organice parser does not allow a section (aka "content") above the first headline
I suggest that stick to the orgmode spec, i.e. allowing a section above the first headline and keeping a hierarchical structure of headlines. document
would then be our S
symbol.
That could be implemented in the transformers in PR #27 .
It will be more work later, to implement org-parser in Organice, but we get a general orgmode parser :-)
On the other hand, a flat list of headlines and sections seem to be very pragmatic. It makes it easier to change headline level and order.
[[:section]
[:headline]
[:section]
[:headline]
…
]
How to keep headline and section together here? This is not pragmatic. Section should belong to the headline.
How about this?
[[:section]
[:headline
[:section]]
[:headline
[:section]]
…
]
A couple of months ago, I've had at look with @munen at what organice expects as a data structure.
From what I recall and based on the discussion in #27 I want to suggest the following (at least for depth 1 and 2):
{;; "In-buffer Settings", see https://orgmode.org/manual/In_002dbuffer-Settings.html
:settings ...
;; Let's call text before the first headline the preamble. As each headline introduces a
;; new section the content before the first headline is a section that does not belong
;; to any headline.
:preamble
{:section {:raw ...
:ast ...}}
;; a flat list of headlines with their associated sections
:headlines
[{:headline {:level 1
:title "hello world"
...}
:section
{:raw "this is the first section\nthis line has *bold text*\n"
:ast [[:text [:text-normal "this is the first section"]]
[:text
[:text-normal "this line has "]
[:text-styled
[:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}
...]}
branch14 and I just double checked this suggestion. It looks fine to me.
As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.
branch14 and I just double checked this suggestion. It looks fine to me.
As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.
I agree. A flat list for headlines is fine. And having the same structure (:section
) in the preamble is good.
It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.
It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.
In order to preserve whitespace we should either (a) include whitespace in the parsed text or (b) retain whitespace in the AST as we do with empty lines. (a) is how it is currently done.
Example input: *bold text* text
Example ast (a): [:text [:text-sty-bold "bold text"] [:text-normal " text"]]
Example ast (b): [:text [:text-sty-bold "bold text"] [:whitespace " "] [:text-normal "text"]]
@schoettl Do you have examples for "discarding" whitespace?
Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.
If you search in EBNF for regex \bs\b
– that are examples where whitespace is parsed but not stored in the AST. It's mostly leading or trailing whitespace. Leading whitespace can often be discarded and re-computed for export/rendering. Trailing whitespace can often be omitted.
-
list-item-line
has currently problems because leading whitespace is discarded but it's important in nested lists. -
block-begin-line
discards whitespace between the begin marker and arguments -
headline
discards whitespace inbetween its components -
clock
lines discard whitespace inbetween its components
I've go through it and I think that only the list-item-line
must be fixed. For the rest, we can discard the whitespace or just re-compute them for export. Verbatim blocks are already parsed verbatim, i.e. not discarding trailing whitespace.
Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.
Maybe the instaparse meta information about position/span can still be used in the resulting transformed structure? Then we don't need any additional raw values and still have can provide all original information.